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INTRODUCTION 


One of the most significant military contributions to psychology 
has been the development of personality inventories. Such inventories 
have proved helpful for the neuropsychiatric screening of adults both 
prior to induction into military service, and after induction. In contrast, 
the use of personality inventories in civilian practice for the past thirty 
years has generally yielded disappointing results.? 

The question thus arises: What are the reasons for the superior 
showing of the personality inventories in military practice? To throw 
some light on this matter, the present authors undertook to review the 
available papers on military validation of personality questionnaires.* 
It was considered that such a review should (1) summarize the findings 


1 The writers wish to express their gratitude to several colleagues who have critically 
read this paper and offered valuable suggestions: Dr. Walter C. Shipley, of Wheaton 
College; Drs. Silvan S. Tomkins, Glenn V. Ramsey, and Norman Frederiksen, of Prince- 
ton University; and Dr. William B. Schrader, of the Educational Testing Service, 
Princeton, N. J. Any errors that may remain are, of course, our own responsibility. 

* See the reviews by Ellis (12, 13), Gilliland (16), Maller (32), Roback (42), Rosen- 
zweig (43), Thorpe (57), Traxler (58), Vernon (60), and Wiley and Trimble (67). Korn- 
hauser (25), questioning 67 well-known psychologists as to how satisfactory they con- 
sidered personality inventories for individual classification, found that only 1.5% of his 
respondents upheld the inventories as highly satisfactory; 13.5% considered them mod- 
erately satisfactory; while the remainder regarded the inventories as ‘‘doubtfully satis- 
factory,” “rather unsatisfactory,” or “highly unsatisfactory.” 

* Papers on the type of questionnaire known as the Biographical Data Inventory 
(such as that used by Fiske (14)), are not included in the present review. The biographi- 
cal inventory generally concentrates on historical data rather than current adjustment 


problems, and is usually quite different from the conventional personality or adjustment 
inventory. 
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from the military applications of personality inventories; (2) describe 
both the legitimate and spurious sources of such military superiority as 
is observed; and (3) point out the lessons for civilian work in this field. 
The present review is limited to these three purposes. No attempt will 
be made, for example, to evaluate the merits of one inventory as against 
another.‘ 


TYPES OF VALIDATION 


There were, in general, two important methods of estimating the 
value of personality inventories in military practice: (1) by use of a 
psychiatric criterion—i.e., comparing the inventory scores of “‘normal”’ 
groups of enlisted men or officers with the scores of others prognosed or 
diagnosed as neuropsychiatrically unfit; and (2) by use of performance- 
measures—i.e., comparing the inventory scores of those who were suc- 
cessful in training or combat, with the scores of those who were failures. 
The results from these two methods of validation will be considered 
separately. 


MILITARY VALIDATION MAKING USE OF PSYCHIATRIC CRITERIA 


A number of military studies were found in which psychiatric criteria 
(either prognoses or diagnoses of unfitness) were employed in the valida- 
tion of personality inventories. The basic data of these reports are sum- 
marized in Table I. Examination of the last column of Table I will 
show that in only a handful of instances were definitely unfavorable or 
negative results obtained, while in the overwhelming majority of studies 
the instrument in question proved to have some value for screening or 
diagnostic purposes. The near-unanimity of favorable results is impres- 
sive—and all the more so when one remembers that psychiatric prog- 
noses themselves are by no means uniformly reliable and valid. How- 


‘With regard to this problem, Wexler, reporting on studies in which different in- 
ventories were employed in identical samples, concludes: “‘In no instance has it appeared 
with clarity and certainty that a particular instrument, a particular content, or a partic- 
ular format is decidedly superior to any other instrument, content, or format” (65, p. 
169). 

5 Two pertinent studies may be cited. (1) Weinstock and Watson (64) reported on 
121 Naval recruits who were allowed tc remain in service (‘‘on trial’’) despite an adverse 
prognosis based on clinical judgment. Of the 121 recruits, only 44 (or 36%) were dis- 
charged for neuropsychiatric reasons during recruit training. This suggests a rather 
high ‘‘false positive’ ratio—which casts doubt on the validity of the clinical prognosis as 
acriterion. (2) Varney and Stone (59), attacking this problem from another angle, report 
a study involving 813 Maritime Service trainees disenrolled for neuropsychiatric causes. 
Of the 813 disenrollees, 247 (or 30%) had passed through a series of psychiatric screening 
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ever, some of the validity figures are so high—.75 for the Bernreuter 
Personality Inventory (NV =200), .78 for the Psychoneurotic Inventory 
(N =600), and .80 for the Psychosomatic Inventory (N =200)*—that 
they suggest the operation of unusual experimental or statistical factors. 
In the sections below, reasons (both spurious and legitimate) for the 
superior validity of personality inventories in military practice will be 
considered. These reasons are not offered as original with the writers, 
nor does the listing of spurious reasons imply that the original investiga- 
tors failed to understand the limitations of their findings. It has seemed 
worthwhile, however, to assemble the various reasons or considerations 
in one place, for convenient reference in the evaluation of research in 


this field. 


1. Criterion contamination. In several of the studies there was either 
some possibility of, or direct evidence of, criterion contamisgation. That 
is to say, the psychiatrists, psychologists, or officers who made the rat- 
ings which served as criteria of the respondents’ maladjustment or fail- 
ure knew the respondents’ inventory scores, and probably allowed these 
scores to influence their judgments. To the extent that this occurs, the 
obtained validity coefficients are, of course, too high. 


Thus, in Coville’s study (10) 77% of the psychiatric disenrollees serving as a 
criterion group for the Maritime Service Inventory were selected by means of 
an ‘‘admission examination.’ In the main, this admission examination con- 
sisted, first, of the MSI; and second (for those screened by the MSI) of a neuro- 
psychiatric interview. MSI scores were freely available for reference during the 
interview. There would appear to be appreciable likelihood of criterion- 
contamination here, even though referral for interview (based mainly on MSI 
score) was at the rate of 30-40% of the examinees, whereas actual rejection for 
“psychiatric and neurological disorders’ was less than 1% (23, pp. 102-103, 
304).—In Gough’s study (17) mention is made of the possibility of criterion 
contamination in some of the psychiatric diagnoses.—In the investigation by 
Miles and others’ (34), the psychiatrists making the original diagnoses were 
aware of the subjects’ Self-Descriptive Inventory scores; and the drill instruc- 
tors who finally judged some of the subjects as unfit were aware of the psy- 
chiatrists’ ratings. Since the drill instructors’ judgments served as the criterion, 
it is evident that the possibility of criterion-contamination here may have been 
significant. 





processes without rejection. Here the proportion of “false negatives’’ is perhaps dis- 
turbingly high. Unfortunately, there are only relatively few studies which, like the ones 
cited, have made use of a follow-up record of actual maladjustment (as distinguished 
from a prognosis alone). While the practical usefulness of the psychiatric prognosis is 
well substantiated, the use of the prognosis alone, as a criterion in scientific research, 
Clearly leaves much to be desired. 

* Summaries of these studies are given in the last three entries of Table I. 

7 See the 24th entry in Table II. 
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TABLE I 


Miitary VALIDITY STUDIES OF PERSONALITY INVENTORIES, 
MakInG Use oF PsycHiaTrRic CRITERIA 








Source Group Tested Criterion Result 
Personal Inventory 
Bobbitt & 99 Coast Guard Patterns of hospi- ‘The patterns of hospital com- 


Newman (7) 


Cerf (8) 


Cerf (8) 


Heathers (21) 


Heathers (21) 


Mote, Berry & 
Graham (38); 
also Berry, 
Leavitt, and 
Mote (6) 


Shaffer (47) 


trainees 


2107 aviation 
trainees 


194 WASPS 


303 AAF en- 
listed personnel 


202 AAF en- 
listed returnees 
491 Naval train- 


ing recruits 


85 AAF officers 


tal complaints of 


men obtaining 
“normal” and 
“abnormal” PI 
scores 


Ratings of satis- 
factory-unsatisfac- 
tory adaptability, 
made by psychia- 
tric interviewers 


Ratings of satisfac- 
factory-unsatisfac- 
tory adaptability, 
made by psychia- 
tric interviewers 


Psychiatric _pa- 
tients vs. non-psy- 
chiatric patients 


“Normals” vs. 
anxiety-reaction 
cases 


“Normals” vs. 
neuropsychiatric 
ward cases 


“Normals” vs. 
anxiety-reaction 
cases 


plaints presented by the two 
groups are quite different, and 
the differences are in accordance 
with expectations.” 


The critical ratio of the mean 
score difference for the satisfac- 
tory and the unsatisfactory 
group was 4.80. Biserial correla- 
tion of —.35, significant at .01 
level. 


Biserial correlation of —.36 (sig- 
nificant) between inventory 
scores and criterion ratings, 


A critical ratio of the mean score 
difference of 8.7 was found. The 
biserial correlation between in- 
ventory stores and the criterion 
was .56, 


A critical ratio of the mean score 
difference of 7.3 was found. 


The inventory identified 52% of 
the neuropsychiatric discharges, 
and included 18% false posi- 
tives. 


Biserial correlation coefficients 
of .43 and .45 were found be- 
tween inventory scores and the 
criterion, 
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Source 


Group Tested 


Criterion 


Result 





Shaffer (47) 


Shaffer (47) 


Shaffer (47) 


Shaffer (47) 


Shaffer (47) 


Shaffer (47) 


Shaffer (47) 


Shaffer (47) 


Shaffer (47) 


Personal Inventory (continued) 


302 AAF 
personnel 


199 AAF pilots 


154 AAF 
navigators 


172 AAF 
bombardiers 


994 AAF 
personnel 


802 AAF 
personnel 


859 AAF 
personnel 


2720 AAF 
returnee 
officers 


1515 AAF 
personnel 


“Normals” vs. 
anxiety-reaction 
cases 


“‘Normals”’ vs. 
anxiety-reaction 
cases 


“Normals” vs. 
anxiety-reaction 
cases 


“‘Normals” vs. 
anxiety-reaction 
cases 


“Normals” vs. 
anxiety-reaction 
cases 


“‘Normals”’ vs. 
anxiety-reaction 
cases 


“Normals” vs. 
anxiety-reaction 
cases 


““Normals”’ vs. 
anxiety-reaction 
cases 


“Normals” vs. 
anxiety-reaction 
cases 


Critical ratios (t’s) of the mean 
score group-differences were 6.75 
between “normals” and mild 
anxiety-reaction cases, and 5.5 
between ‘“‘normals” and severe 
anxiety-reaction Cases. 


A critical ratio of the mean score 
group-difference of 7.16 was 
found. 


A critical ratio of the mean score 
group-difference of 5.02 was 
found. 


A critical ratio of the mean score 
group-difference of 5.67 was 
found. 


A biserial correlation coefficient 
of .52 between inventory scores 
and criterion was obtained. 


A biserial correlation coefficient 
of .45 between inventory scores 
and criterion was obtained. 


“If a cut-off score of 10 or less 
were used, 49% of the anxiety- 
reaction cases but only 12% of 
the normals would be included. 
The separation is not quite so 
effective at the upper end of the 
distribution.” 


“Item analysis ... showed that 
a large proportion (31 out of 45) 
of the new items were valid.” 


18 out of 20 items on the sched- 
ule were shown to be valid by 
item analysis. 
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TABLE I (continued) 








Source Group Tested Criterion Result 
Personal Inventory (continued) 
Shipley & 623 Naval “Normals” vs. “Differentiation continued at a 
Graham (48) _ personnel psychiatric high level... for example, one 
discharges cutting score identified 52% of 


Shipley & 
Graham (48) 


571 Naval re- 
cruits & 491 lim- 
ited service men 


1385 enlisted 
Naval personnel 


Shipley, Gray 
& Newbert 
(49) 


538 Naval 
trainees and 
263 psychiatric 
ward discharges 


Shipley, Gray 
& Newbert 
(50) 


136 AAF 
enlisted men 


Stoiurow and 
Schrader (55) 


Stone and 1266 Maritime 
Malament* Service trainees 
(56) 


Psychiatric dispo- 
sition subsequent- 
ly made of the men 


“Normals” vs. 
psychiatric 
discharges 


“Normals” vs. 
psychiatric 
discharges 


Combat returnees 
vs. non-combat 
enlisted men 


Non-diagnosed 
men vs. trainees 
disenrolled for 
neuropsychiatric 
reasons 


the discharges while including 
4% of the normal men.” 


“Good differentiation was again 
found. One cutting score identi- 
fied 52.5% of the psychiatric dis- 
charges, while including less 
than 7% of the normals.” 


All sixty items of the original 
scoring stencil continued to dif- 
ferentiate successfully between 
discharges and ‘“‘normals.”’ Crit- 
ical ratios of the mean score 
group-differences ranged from 
2.4 to 15.9 for the 60 items. 


“A critical score of 8 . . . identi- 
fied 68.8% of the psychiatric dis- 
charges and included but 4.45% 
of the ‘normals.’ The validity of 
each item also proves satisfac- 
tory, with critical ratios ranging 
from 3.8 to 16.7.” 


“The ex-combat men obtained 
higher scores on the inventory, 
which was indicative of greater 
maladjustment within this 
group. The difference between 
the mean scores of the two 
groups was significant...” (¢ 
= 72.43). 


At a cutting score of 20, 65.4% 
(N =30) of the neuropsychiatric 
disenrollees were detected, at a 
cost of 7.3% (N =88) false posi- 
tives. 


* In this study the New London NDRC Inventory, a forerunner of the Personal In- 
ventory, was employed. 
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Source 


Group Tested 





Wexler (65) 


Wexler (65) 


Dynes (11) 


Harris (20) 


Heathers (21) 


Heathers (21) 


Criterion 


Result 





Personal Inventory (continued) 


2152 Naval en- 
listees entering 
recruit training. 


340 Naval en- 
listees (242 nor- 
mal, 98 malad- 
justed) 


“‘Normals”’ vs. re- 
ferrals returned to 
duty and neuro- 
psychiatric dis- 


charges 


Ratings by psychi- 
psy- 


atrists and 


chologists 


With a cutting score set to refer 
20% of the total group for psy- 
chiatric check, and excluding the 
57 referrals who were returned to 
duty, there were 72% true posi- 
tives, 16% false positives, and 
28% false negatives; the respect- 
tive N’s are 71, 319, and 28. 


With a cutting score set to refer 
30% of the total group for psy- 
chiatric check, there were 69% 
true positives, 14% false posi- 
tives, and 31% false negatives; 
the respective N’s are 68, 34, and 
30. 


Cornell Selectee Index 


2000 Naval per- 
sonnel returned 
from duty 


Naval _person- 
nel applicants at 
a Navy Yard 


300 AAF 
enlisted men 


300 AAF 
personnel 


Hospitalization for 
break- 
down on basis of 


nervous 


psychiatric exami- 
nation 


Neuropsychiatric 
screening 


Anxiety-reaction 
and psychoneu- 
rotic patients vs. 
non-psychiatric 
patients 


Anxiety-reaction 
and psychoneu- 
rotic patients vs. 
non-psychiatric 
patients 


All the men who had to be hospi- 
talized for nervous breakdown 
obtained a CSI score of 25 or 
more. 


“Those showing a score of 25 or 
more invariably fell into the 
category of severe psychoneu- 
rotics and could, therefore, be 
earmarked for careful question- 
ing. Those showing scores of 
less than 15 could almost as 
readily be accepted for employ- 
ment.” 


Critical ratios of the mean score 
differences from 3.2 to 7.1 show 
that the CSI “gives a reliable 
differentiation of anxiety-reac- 
tion and psychoneurotic patients 
...from orthopedic and medi- 
cal surgical patients.” 


A critical ratio of the mean score 
difference of 8.2 between non- 
psychiatric and anxiety-reaction 
patients; and of 10.9 between 
non-psychiatric and psychoneu- 
rotic groups. 
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TABLE I (continued) 





Source Group Tested Criterion Result 





Cornell Selectee Index (continued) 


Sick-book riders 
vs. non-sick-book 
riders 


Manson and 
Grayson (33) 


“The Cornell Selectee Index is 
an excellent instrument for iden- 
tifying ‘sick book riders’ and 


200 Army 
military 
prisoners 





Weider and 
Wechsler (61) 


Weider and 
Wechsler (61) 


Weider and 
Wechsler (61) 


Weider and 
Wechsler (61) 





1000 selectees 


204 neuropsy- 
chiatric dis- 
charges and 406 
men accepted 
for military 
training 


600 psychiatric 
accepts and 400 
psychiatric sel- 
ectee rejects 


539 “normal” 
selectees; 142 
psychoneurotic 
discharges; 260 
moderately psy- 
cho neurotic 
men; 39 mildly 
psychoneurotic 
men 


Neuropsychiatric 
screening 


Neuropsychiatric 
disability dis- 
charge vs. military 
training accept- 
ance 


Neuropsychiatric 
screening 


Neuropsychiatric 
diagnosis 


eliminating non-‘sick book rid- 
ers,’ when the critical score of 25 
or more is used. The CSI does 
not differentiate to any marked 
degree between ‘sick book’ and 
non-‘sick book riders’ for the 
‘mild psychoneurosis’ or the 
‘non-psychoneurosis’ catego- 
ries,”’ 


89% (N=89) of the men re 
jected by the neuropsychiatric 
screening were also screened by 
the Index, with 12% (N=110) 
false positives. 


82% (N=173) of the discharges 
were also detected by the Index, 
with 6% (N =27) false positives. 


71% (N =284) of the psychiatric 
rejects were also screened by the 
Index, with 15% (N=89) false 
positives, at a cutting score of 15 
plus 1 stop question. 


.9% (N=5) of the “normals,” 
13% (N=5) of the “mild” psy- 
choneurotics, 78% (N=202) of 
the “moderately severe” psy- 
choneurotics, and 92% (N =131) 
of the “severe” psychoneurotics 
were screened at a cutting score 
of 23. 
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TABLE I (continued) 








Source Group Tested Criterion Result 
Cornell Selectee Index (continued) 
Weinstock 212 Naval Cornell Selectee Of 91 recruits with high CSI 
and Watson training Index diagnoses scores who were retained ‘on 
(64) recruits vs. psychiatric dis- trial,’’ only 38 (or 23%) had to 


Wexler (65) 


Wexler (65) 


Wolff & 
others (75) 


Wolff & 
others (75) 


Wolff & 
others (75) 


Wolff & 
others (75) 


Wolff & 
others (75) 


2152 Naval en- 
listees entering 
recruit training 


340 Naval en- 
listees (242 nor- 
mal, 98 malad- 
justed) 


307 selectees 


1863 selectees 


1390 selectees 


282 Army and 
Navy discharges 


63 selectees who 
had already 
been screened 


charge 


“Normals” vs. re- 
ferrals returned to 
duty and neuro- 
psychiatric  dis- 
charges 


Ratings by psychi- 
atrists and psy- 
chologists 


Neuropsychiatric 
screening 


Neuropsychiatric 
screening 


Neuropsychiatric 
screening 


Discharge for 
neuropsychiatric 
disorders 


Second neuropsy- 
chiatric examina- 
tion 


be discharged during training. 
91 of the 164 recruits with high 
CSI scores had been passed as 
acceptable by clinical judgment 
prior to the CSI; only 8 of these 
91 had to be discharged. 


With a cutting score set to refer 
20% of the total group for psy- 
chiatric check, and excluding the 
57 referrals who were returned to 
duty, there were 71% true posi- 
tives, 17% false positives, and 
29% false negatives; the respec- 
tive N’s are 70, 339, and 29. 


With a cutting score set to refer 
30% of the total group for psy- 
chiatric check, there were 67% 
true positives, 15% false posi- 
tives, and 33% false negatives; 
the respective N’s are 66, 36, 
and 32. 


86% of the men rejected by the 
neuropsychiatric screening were 
also screened by the Index. 


80% of the men rejected by the 
neuropsychiatric screening were 
also screened by the Index. 


87% of the men rejected by the 
neuropsychiatric screening were 
also screened by the Index. 


88 to 90% of the men discharged 
were screened by the Index. 


87% of the men rejected by the 
second neuropsychiatric screen- 
ing were also rejected by the 
Index. 
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Source 


Group Tested Criterion Result 










































Wolff & 
others (75) 


Wolff & 
others (75) 


Stone & 
Malament 
(56) 


Stone & 
Malament 
(56) 


Varney & 
Stone (59) 





Coville (10) 


Cornell Selectee Index (continued) ; 


50 selectees who Second neuropsy- In four cases the second inter- 


had already chiatric examina- viewer reversed the decision of 
been inter- tion the first one partly as a result of 
viewed the added information afforded | 


by the Index. 


380 Southern Neuropsychiatric The Index screened 30 out of 31 
white and Negro screening men rejected by the neuropsy- 
selectees chiatric interview method. } 


Maritime Service Inventory 


678 Maritime “Normals” vs. With a cutting score of 20, 66% 
Service trainees neuropsychiatric of the neuropsychiatric subjects 
subjects were identified, at a cost of 2 


false positives. When “stop” 
items were included, 93% true 
positives were detected to 2% 
false positives. Critical ratio of 
the mean score difference, 21.5 


1208 Maritime Non-diagnosed At a cutting score of 21, 66% 

Service trainees men vs. trainees (N =68) of the neuropsychiatric 
disenrolled for disenrolless were detected, at a 
neuropsychiatric cost of 13% (N=144) false posi- 
reasons tives. 

1018 Maritime Non-diagnosed At a cutting score of 15, 60% 

Service trainees men vs. trainees (N =34) of the neuropsychiatric 
disenrolled for disenrollees were detected, at a 
neuropsychiatric cost of 12% (N=111) false posi- 
reasons tives. 


813 Maritime Disenrollment for Of 813 disenrollees, 566 (70%) 
Service trainees neuropsychiatric | were detected by procedures in 


disenrolled for reasons which the MSI played a promi- 
neuropsychiatric nent role; 247 (30%) were dis- 
reasons enrolled after having passed 


through the screening proce- 
dures unrejected. 
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TABLE I (continued) 








Source 


Group Tested 


Criterion 


Result 











Benton (4) 


Benton & 
Probst (5) 


Gough (17) 


Leverenz (31) 


Modlin (36) 


Minnesota Multiphasic Inventory 


85 Naval neuro- 
psychiatric pa- 
tients 


76 neuropsychi- 
atric Naval pa- 
tients 


166 enlisted 
men 


105 psychiatric 
ward Army 
cases 


316 enlisted 
Army personnel 


Neuropsychiatric 
ward status 


Ratings by psychi- 
atrists 


“‘Normals”’ vs. 
psychiatric hospi- 
tal admissions 


Neuropsychiatric 
diagnosis 


Neuropsychiatric 
ward cases vs. 
‘‘normals”’ 


Five out of 10 schizophrenics 
gave positive results on the 
Schizophrenia Scale; 5 out of 9 
hysterics were positive on the 
Hysteria Scale; 13 of 16 delin- 
quents were positive on the Psy- 
chopathic Deviate Scale. 


“In the case of the Psychopathic 
Deviate, Paranoia, and Schizo- 
phrenia trends the differences 
with respect to mean test score 
between the normal and the ab- 
normal groups can be considered 
to be significant (CR’s 2.6 to 
3.2)....On the other hand, 
there is no substantial amount 
of agreement with respect to the 
strength of the Hypochondriasis, 
Depression, Hysteria, Feminin- 
ity, and Psychasthenia trends.” 


Significant critical ratios showed 
that the MMPI successfully dif- 
ferentiated among several diag- 
nostic subgroupings. 


“In the majority of the cases the 
scores on the Inventory did con- 
firm the clinical impression.” 
(No statistical data given.) 


“Depression was most success- 
fully verified by the Multiphasic 
Inventory, inasmuch as 88% of 
31 clearly classified depressives 
scored highest on the D Scale 
..-A close correlation with 
clinical expectations is seen in 
most of the categories.” 
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TABLE I (continued) 








Source Group Tested Criterion Result 





Minnesota Multipharic Inventory (continued) 
Morris (37) 320 Naval per- Neuropsychiatric ‘The nosological groups under 


sonnel under diagnosis consideration could not be dif- 
psychiatric ob- ferentiated from one another on 
servation the basis of Inventory scores... 


The Inventory does differentiate 
borderline normals from serious 
pathological states but does not 
aid in the differential diagnosis 
among the pathological groups.” 





Schmidt (46) 211 AAF “Normals” vs. Critical ratios of the mean score 
personnel men diagnosed as_ group-differences showed signi- 
neuropsychiatric _ficant inventory differences be- i 
cases between ‘“‘normals” and men di- 


agnosed as psychopaths, neu- 
rotics, and psychotics. 


, Experience Comparison Index 


Owens (39); 400 enlisted “Normals” vs. “False positive scores not only 
and Owens Naval personnel men diagnosed ranged as high as 30, but they 
and Zirkle psychiatrically as occurred more frequently than | 
(40) maladjusted the true positive scores.” 187 


positive to 217 false positives 
out of 400 tested men. 


Wexler (65) 582 Naval en- Ratingsbypsychi- With a cutting score set to refer 
listees (209 well- atrists and psy- 40% of the total group for psy- 


adjusted, 241 chologists chiatric check, and excluding the 
doubtful, 132 241 doubtful cases, there were 
poorly adjusted) 71% true positives, 13% false 


positives, and 29% false nega- 
tives; the respective N’s are 94, 


27, and 38. 
Personal Check List 
Owens (39) 600 enlisted “‘Normals” vs. The Personal Check List re- 
Naval personnel men psychiatri- ferred 417 positives to 183 false 
cally diagnosed as_ positives. 


maladjusted 


Wexler (65) 561 Naval en- Ratings by psychi- With a cutting score set to refer 
listees (418 nor- atrists and psy- 30% of the total group for psy- | 
mal, 143 malad- chologists chiatric check, there were 75% 
| justed) true positives, 15% false posi- 
tives, and,25% false negatives; 
i the respective N’s are 107, 63, 
and 36. 
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Source 


Group Tested 


Criterion 


Result 





Wexler (65) 


Wexler (65) 


Heathers (21) 


Heathers (21) 


H. C, Leavitt 
(26) 


Heathers (21) 


Billet Qualifications Blank 


2152 Naval en- 
listees entering 
recruit training 


268 Naval en- 
listees (104 well 


adjusted, 112 
doubtful, 52 
poorly adjusted) 


“Normals” vs. re- 
ferrals returned to 
duty and neuro- 
psychiatric  dis- 
charges 


Ratings by psy- 
chiatrists and psy- 
chologists 


With a cutting score set to refer 
20% of the total group for psy- 
chiatric check, and excluding the 
57 referrals who were returned 
to duty, there were 69% true 
positives, 17% false positives, 
and 31% false negatives; the re- 
spective N’s are 68, 339, and 31. 


With a cutting score set to refer 
40% of the total group for psy- 
chiatric check, < ad excluding the 
112 doubtful cases, there were 78 
true positives, 13% false posi- 
tives, and 22% false negatives; 
the respective N’s are 41, 14, and 
11. 


Convalescent Personal Inventory 


235 AAF en- 
listed personnel 


441 AAF en- 
listed personnel 


Anxiety-reaction 
patients vs. non- 
anxiety reaction 
patients 


Psychiatric vs. 
non-psychiatric 
patients 


A critical ratio of 9.2 “indicates 
a high degree of differentiation.” 


A biserial correlation of .38 was 
found between the inventory 
scores and the criterion. “The 
psychiatric group was found to 
have a reliably greater number 
of significant responses.” 


Neuropsychiatric Adjunct Inventory 


768 military 
inductees 


Ratings by 
psychiatrists 


About 85% of the men present- 
ing the more common psycho- 
pathological syndromes were 
successfully screened by the in- 
ventory. 


Questionnaire Regarding Present Reactions 


200 AAF en- 
listed personnel 


Psychiatric vs. 
non-psychiatric 
patients 


A critical ratio of the mean score 
difference of 6.2 was found; and 
a biserial correlation of .67. “A 
critical score of 40 significant re- 
sponses screened 69% of pa- 
tients diagnosed as psychiatric 
and 21% of patients diagnosed 
as non-psychiatric.” 
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Source 


Group Tested Criterion Result 





Heathers (21) 


Smith & Voss 
(53) 


Page (41) 


Page (41) 


Page (41) 


Page (41) 


Inventory of Psychological Problems 


426 AAF en- Psychiatric vs. Critical ratios of the mean score 
listed personnel non-psychiatric differences between psychiatric 
patients and non-psychiatric groups 


ranged from 1.4 to 7.6. “The 
total scores on frequency and 
severity items of the inventory 
discriminated psychiatric and 
non-psychiatric groups of pa- 
tients. 


Officer Personal Inventory 
1383 officers Qualified officers ‘A cut-off score of 30 on this key 


vs. emotionally marked off 53.01% of the emo- 
disqualified tionally disqualified officers and 
officers 2.62% of qualified officers.”’ Sig- 


nificant critical ratios were ob- 
tained for all the items of the 
test, 


Psychoneurotic Inventory 


244 Army Number of times Coefficient of correlation be- 
trainees the men went on tween the number of times the 
sick call men went on sick call and their 


Inventory scores was .25. 


600 enlisted Neurotics vs.un- The critical ratio of the mean 

Army personnel diagnosed men score difference between groups 
was 15.7; the biserial correlation 
coefficient was .78. 


Bernreuter Personality Inventory 


200 enlisted Neurotics vs.un- The critical ratio of the mean 

Army personnel diagnosed men score difference between groups 
was 12.27; the biserial correla- 
tion coefficient was .75. 


Psychosomatic Inventory 


200 enlisted Neurotics vs.un- The critical ratio of the mean 

Army personnel diagnosed men score difference between groups 
was 8.8; the biserial correlation 
coefficient was .80. 
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The remarks of Smith and Voss, reporting on their validation of the 
Officer Personal Inventory, are of interest in this connection: 


The second restrictive condition was the impossibility of establishing com- 
plete independence between the Officer Inventory scores and the diagnoses of 
emotional fitness, at the tirme the diagnoses were made. In the established 
classification procedure, which it was not possible to alter, the Boards of Medi- 
cal Examiners had available at the time of their examinations, total Officer 
Personal Inventory scores, based on a tentative 34-item scoring key, for over 
half of the officers concerned in this study. Hence, there may have been some 
suggestive effect exercised by the knowledge of the Inventory scores upon the 
diagnoses of emotional fitness . . . (53, pp. 1-2). 


In other words: even when military experimenters were well aware 
of the necessity of avoiding criterion contamination, military conditions 
sometimes made it impossible for them to eliminate all such contamina- 
tion. Consequently, several of the obtained validity coefficients are 
doubtless higher than they should be. The exact extent of validity-in- 
flation, however, is extremely difficult to evaluate. 

2. Criterion overlap. Aside from the possibility of criterion contami- 
nation, overlap between criterion and inventory seems to have occurred 
in some of the validity investigations. Thus, even when the psychiatrists’ 
judgments of the respondents were made wholly independently of these 
respondents’ inventory answers, the questions asked by the psychia- 
trists in several instances duplicated those included in the inventories 
(or vice versa). This means that the prognosis of the psychiatrists and 
the inventories might be in close agreement, without either one’s being 
necessarily accurate in terms of the actual outcome of the prognosti- 
cated cases. The obtained validity coefficients might consequently be 
spuriously high. 

Overlap between the criterion and the inventory under investigation 
may be minimized by employing an actual outcome rather than a mere 
prognosis. Thus, the criterion may be actual success or failure to adjust 
to military life, or it may be neuropsychiatric discharge because of break- 
down in military service. But in many of the reported studies the cri- 
teria could not be of this definitive nature. Often the criteria were 
merely prognoses based on rather brief psychiatric interviews. In these 
latter instances, criterion overlap was common, and is judged to have 
caused validity coefficients which must be taken with at least ‘‘a grain 
of salt.’ 

3. Sample heterogeneity. The unscreened recruits tested by the armed 
forces were frequently extremely heterogeneous, including at the lower 
end unemployables, tramps, loafers, ‘‘bums,’’ alcoholics, frank neurotics 
and so on (9). It should be relatively easy (by almost any technique) 
to sort out such individuals. The samples in civilian studies of personal- 
ity inventories, on the other hand, have usually included relatively few 
of such easily diagnosable cases. This probably accounts, at least in 
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part, for the higher degree.of success in the military reports on personal- 
ity inventories. 

4. Extreme or atypical validation groups. In several of the military 
studies of personality inventories, the groups that were employed to test 
the validity of the inventory were of a somewhat special or extreme sort. 
Of course the ‘“‘method of extreme groups” is of recognized value in the 
development of a test; but for a check on practical, operating validity, 
application of the test instrument to an unselected sample appears pref- 
erable. If, for example, a sample contains few or no “‘doubtful”’ cases, 
it is free of the very cases which are hardest to diagnose. Similarly, if a 
sample contains an unusually large proportion of maladjusted cases, 
the proportion of ‘‘false positives’ to true positives at any given cut-off 
score will necessarily be lower than in a sample containing only a normal 
complement of maladjusted. (If the sample contained only maladjusted 
cases, there could not be amy false positives.) Such facts have not always 
received full recognition in validity-studies of screening instruments. 
Some studies in which samples of abnormal composition were used, with 
inferences as to validity, are cited below. 


In Page’s (41) and Heathers’ (21) studies, groups were used in which the 
number of abnormal cases equalled the number of normal—clearly an atypical 
ratio. 

In some of Wexler’s (65) tables (though not in his graphs), the percentage of 
false positives has been figured on a group from whom ‘‘doubtful’’ cases have 
been excluded. (In one sample of 582 men, the ‘‘doubtful” cases constituted no 
less than 41% of the total.) 

Smith and Voss (53) found that the Officer Personal Inventory distinguished 
significantly between 1300 qualified officers and 83 emotionally disqualified 
ones. But the qualified men had already been passed by a medical examiners’ 
board, and were therefore (so to speak) “‘super-normal”’ rather than “normal.” 
Thus the sample presumably included relatively few of the borderline group— 
differentiation of whom is ordinarily the most difficult task of an inventory. 

The “normal’’ group used in Schmidt’s (46) study obtained Minnesota 
Multiphasic Inventory mean scores of below 50 on eight of the nine MMPI 
scales—thus indicating that it, too, was probably super-normal. 

Shipley, Gray, and Newbert (49), in a study involving the Personal Inven- 
tory, reported that ‘“‘the normal group comprised 1004 newly enlisted men who 
had been favorably passed upon in the psychiatric interview; the deviating 
group, 385 early psychiatric discharges tested while under observation, and 
prior to discharge, on the psychiatric ward” (49, p. 2). Here again a screened 
or ‘‘super-normal” group is being compared with a definitely abnormal one. 
Moreover, the ratio of 385 “early psychiatric discharges” to 1004 is clearly 
atypical: a more normal ratio would appear to be only 50 to 1000, or perhaps 
as high as 100 to 1000 (65, p. 126). 


Regarding military studies which make use of hospitalized patients, 
Wexler makes the following pertinent observation: 
Nothing is so apt to produce delusions of grandeur in the constructor of 
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personality inventories than the use of what might be termed the “‘ex post facto” 
psychiatric criterion. This involves a technique by which a hospitalized popula- 
tion is contrasted with successful military personnel in a training center. The 
difficulty with this procedure is that patients in a neuropsychiatric ward have 
generally become so sensitized to clinical symptomatology that they no longer 
react in the way in which they would have reacted to the same instrument prior 
to hospitalization. Furthermore, the hospital wards generally contain a selected 
sample from the extreme of the deviated cases. This means an extremely 
stringent criterion group.... It is no remarkable thing, then, that test score 
differentiation from a young, enthusiastic, and healthy recruit group can be 


obtained with considerable ease (65, pp. 140-141). 


5. Differential motivation. It seems likely that in some of the armed 
forces’ validation studies the ‘‘abnormal’’ criterion groups were differ- 
ently motivated than the ‘‘normal’’ groups; and that consequently some 
of the obtained significant differences were exaggerated. Relevant here 
are the following considerations. 

a. A good many of the neuropsychiatric groups that were used as 
criterion groups consisted exclusively of neuropsychiatric ward cases. 
It is possible that these ward cases, who were only a step or two from 
transfer to inactive service or from complete release, had a definite in- 
centive to answer the inventory questions in a self-incriminating man- 
ner. This would clearly differentiate them from non-hospitalized ‘‘nor- 
mals,’’ and spuriously boost the obtained validity coefficients. 

b. In the study by Manson and Grayson (33) employing the Cornell 
Selectee Index, it was found that this Index differentiated significantly 
between sick-book riders and non-sick-book riders. It is :tated by the 
authors that the sick-book riders were not discouraged from obtaining 
temporary escape from the rigorous training program of the Center 
where they were being held as prisoners. Thus they were allowed to 
achieve a definite “‘neurotic gain”’ from their sick-book riding. But since 
the CSI consists largely of psychosomatic questions; and since the sick- 
book riders (who were tested after they became sick-book riders) were 
already on record as having various psychosomatic complaints—it 
would be odd if they did other than check considerably more CSI re- 
sponses than did non-sick-book riders. 

c. Altus and Bell, after reporting significant inventory score-differ- 
ences between successful and failing illiterate Army Special Training 
Center inductees, are careful to note that ‘‘one caveat, however, must 
be entered. In the ordinary school situation, passing a test or success- 
fully completing a course has value as a goal toward which to strive. 
For some, if not many, of the trainees, passing the tests for graduation 
may not represent a desirable goal; for it means retention in the Army, 
an outcome which occasionally finds some quite intelligent soldiers un- 
enthusiastic. Becoming a soldier meant to the average trainee giving up 
a job which paid him two to three times more than he ever earned in his 
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life before’ (2, p. 103). In other words: some of the trainees (whether 
consciously or unconsciously) had a definite motive (a) for failing the 
inventory—that is, appearing to be neurotic; and (b) for failing the 
course. The observed significant inventory-differences, therefore, be- 
tween those who passed and those who failed, are probably partly spuri- 
ous. 
d. In separate studies by Gough (17) and Schmidt (46) it was found 
that such diagnosed groups of military personnel as psychopaths, neu- 
rotics, and psychotics obtained Minnesota Multiphasic Personality In- 
ventory scores that significantly differentiated them from the “‘normal”’ 
controls on nearly all the MMPI scales. This would seem to indicate 
that either the MMPI scales do not actually differ from each other as 
they are supposed to; or that the respondents in the psychiatric groups 
were motivated deliberately to answer all kinds of MMPI questions in 
a manner unfavorable to themselves. Curiously enough, in the Gough 
study the Lie scores of the psychiatric respondents were always in the 
normal range, but the ? scores of the psychiatric subjects were almost 
always considerably below those of the ‘‘normals.”’ It may therefore be 
wondered if the psychiatric respondents were able to avoid the Lie ques- 
tions (which are fairly transparent) but were so eager to commit them- 
selves to unfavorable answers to the other items that they gave sus- 
piciously few ? responses. 

e. In one of the investigations reported by Shipley and Graham (48), 
it was found that, for some unexplained reason, 2301 Amphibious Forces 
men, who had already been psychiatrically screened and designated as 
“normal” cases, obtained a mean Personality Inventory score of 15.2— 
which was definitely worse than the mean score of 14.6 obtained by 74 
psychiatrically disapproved Submarine School personnel. Such a re- 
versal, when obtained with an inventory that was generally successful 
in differentiating between psychiatric cases and ‘‘normals,’’ suggests 
first that the psychiatric standards applied to the submarine men were 
higher than those in the Amphibious Forces. Another possibility is that 
the Amphibious Forces men were indifferent to whether or not they 
made favorable scores on the Inventory, while the submarine men tried 
to make as good a showing as possible. In this connection, a comment 
by Wexler is pertinent: 


Men applying for submarine duty are highly motivated. As a volunteer 
group, they are extremely eager to make a good impression and qualify for sub- 
marine training. Raw recruits, too, generally have reasonably high motivation 
since all the indications are that men passing through the initial training stages 
are generally eager to do well in the service, though the motivation is hardly as 
high as in the case of the submarine group. There is reason to believe that the 
amphibious forces were somewhat less motivated. Certainly the motivation of 
the combat-experienced groups, at least as far as the tests were concerned, was 
less than for any of the other groups. Simple observation has confirmed the fact 
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that these men were indifferent to the tests and cynical concerning anything 
except the chance for immediate leave (65, p. 168). 


In sum, it seems clear that differences in group-motivation can cause 
significant differences in personality inventory scores. This factor re- 
quires attention in evaluating observed group-differences. 

6. Honesty of response. The preceding section has cited instances 
where responses in certain groups were probably distorted either in the 
direction of self-incrimination (e.g., neuropsychiatric ward cases) or in 
the direction of self-inflation (submarine men). There is, however, the 
possibility that, on the average, the members of the armed services 
answered the personality inventories with less distortion than civilians 
ordinarily do; and that, in consequence, the inventories tended to be 
more effective in military than in civilian practice. Relevant in this 
connection are the following points. 

a. Military personnel may be specifically encouraged and warned to 
give honest inventory responses by test administration techniques that 
are impossible with civilians. Thus, in Coville’s study (10) the subjects 
were warned that dishonest answers would inevitably be discovered later 
and would then lead to a dishonorable discharge. They were also cau- 
tioned that, for their own medical well-being, they should be particu- 
larly honest about admitting physical and nervous symptoms. 


b. Harris, using the Cornell Selectee Index with Naval personnel, 
makes the following report: 


Rarely do patients fail to answer truthfully the question, ‘‘Were you ever a 
patient at a mental hospital?’”’ On the contrary, they have usually been so 
anxious not to be caught in written untruths, that they are likely to answer 
“Yes” to that question, whereas, actually, the confinement was to a general 
hospital for injury or nonpsychiatric observation. Likewise, the question, 
“Have you ever had a fit or a convulsion?” is also answered with meticulous 
truthfulness, thereby leading to a diagnosis of epilepsy or hysterical loss of 
consciousness, equally undesirable in Yard employees or Navy personnel. 
Often, in his anxiety to answer the question correctly, the applicant is led to 


include such occurrences as ‘‘fits of temper’’ or simple syncope, also important 
in his evaluation (20, p. 596). 


c. Wolff and his associates, after considerable experience with the 
Cornell Selectee Index, note that ‘‘although responses may be falsified, 
in practice we have noted this infrequently”’ (75, p. 9). Civilian studies, 
on the other hand, show quite frequent and consistent falsification or 
exaggeration of personality inventory scores (12, pp. 414-420). 

d. Adiscordant note is struck by Cerf* who, employing the Informa- 
tion Blank, S-C, CE 410 A with Army Air Force personnel, discovered 
that “the truthfulness scores of the group were generally low. Of the 
ten truthfulness items, the average aviation student made truthful re- 


§ For a summary of Cerf’s study, see the 22nd entry in Table II. 
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sponses to only 5.4.” This indicates ‘‘a substantial amount of falsifica- 
tion of response’’ (8, p. 580). It is perhaps significant that in this par- 
ticular study the inventory employed did not successfully distinguish 
between passing and failing aviation students. 

7. Role of intelligence. In some of the military studies of personality 
inventories, the poor showing of some respondents may have been due 
to their failure fully to comprehend either the instructions of the inven- 
tory, the content, or both. If (as seems rather likely) such respondents 
tended also to be relatively unadaptable to military life, the effect would 
be to boost the military validity of the personality inventories. In this 
connection, Shipley, Gray, and Newbert noted a ‘‘tendency . . . for men 
with low General Classification Test scores to show up as maladjusted 
on the Personal Inventory” (51, p. 7). This observation is based on a 
correlation of —.28 between AGCT scores and Personal Inventory scores 
of the testees. Such a correlation is too low to give much support to the 
idea of a common intelligence-factor in inventory-scores and adaptabil- 
ity to military life; but it is quite possible that the relation is curvilinear, 
and stronger in the lower reaches of intelligence than in the levels above. 

Altus and Bell, administering their inventory orally to Army illiter- 
ates, found that the inventory scores indicated ‘“... considerably 
more hysteria, hypochondria, paranoia, and depression among these 
men of low socio-economic status and of marginal intellect than is true 
of groups approximating normal intelligence and socio-economic level. 
For these latter groups, more subtle questions would doubtless be re- 
quired”’ (2, p. 476). If Altus and Bell are right in their remark about 
the need for more subtle questions, it appears that the comparatively 
direct questions of personality inventories probably have special valid- 
ity when used with less intelligent or more naive individuals. This may 
well be one of the reasons for the superior validity of the inventories in 
military applications; since civilian applications of the inventories have 
commonly been to groups that are more intelligent, better educated, and 
more sophisticated than the typical (enlisted) military sample. 

8. Statistical inadequacies. In some validations of personality inven- 
tories in military practice, there were statistical inadequacies which 
throw doubt upon the accuracy of the obtained significant differences or 
validity coefficients. Some of these statistical inadequacies will now be 
discussed. 

a. In several of the studies involving biserial coefficients of correla- 
tion, the coefficients were calculated on the basis of an almost equal 
number of ‘‘normals’’ and abnormals. Actually, since ‘‘normals’’ are 
generally far more numerous than abnormals, biserial correlation co- 
efficients should be calculated only on groupings which approximate the 

-,‘‘normal”-abnormal ratio. Yet in Page’s (41) study several of the bi- 
serial correlations were based on 100 neurotics and 100 undiagnosed 
cases; and in several of Heathers’ (21) validation experiments, biserial 
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correlations were based on 145 psychiatric and 158 non-psychiatric pa- 
tients, on 100 psychiatric and 100 non-psychiatric patients, and on 215 
psychiatric and 226 non-psychiatric cases. Biserial coefficients computed 
for such samples are spuriously high. 

b. In several of the studies results are reported only in terms of the 
critical ratio of the mean score difference beween psychiatric and non- 
psychiatric groups, rather than in terms of the correlation coefficient. 
Critical ratios, however, can be misleading when used for test validation, 
as shown by the instance where Williams, Leavitt, and Mendola (73) 
found a critical ratio of 3.6 for the difference between the mean scores of 
passing and failing Marine Corps Officer candidates who took the Per- 
sonal Inventory. When the biserial correlation was calculated for the 
same groups, the correlation coefficient was found to be only .18. Care- 
ful analysis led the authors to conclude that “‘it is clear that the degree 
of overlapping of the pass and fail distributions is so great as virtually 
to preclude use of the stencil in individual selection. Only at the very 
extreme of the distribution would the test score be very significantly 
better than chance as a predictor of success or failure in Platoon Com- 
manders School” (73, p. 7). 

c. When correlation coefficients of validity were calculated, some of 
the authors of the reported studies read considerable significance into 
coefficients which were hardly very high. Thus, Satter, after obtaining 
a validity coefficient of .39 when using the Personal Inventory on para- 
chute trainees, remarked that ‘‘in terms of what we know about the pre- 
dictive efficiency of personality measures in general, this coefficient is 
gratifyingly high’”’ (44, p. 30). While it is true that a correlation of .39 
is comparatively high for personality inventory validation, it still leaves 
much to be desired. 

d. Several of the experimenters unfortunately employed the same 
group for validation purposes that they had originally employed for the 
standardization of their instruments. This procedure will almost invari- 
ably result in spuriously high validity coefficients or critical ratios (par- 
ticularly if the original sample is not very large). Smith and Voss (53), 
Page (41), Williams and Leavitt (68), and Heathers (21) all seem to 
have at times employed this particular technique. As Williams and 
Leavitt point out, this does not imply that the investigators were un- 
aware of the problem; but that the exigencies of military practice some- 
times precluded their using more satisfactory validation procedures. 
Nevertheless, the fact remains that in those instances where validation 
was accomplished without a fresh sample, the observed critical ratios or 
validity coefficients are generally too high. 

e. In some of the studies, very small numbers of cases were employed 
in the diagnostic subgroups. Thus, in Gough’s (17) and in Schmidt’s 
(46) studies the number of individuals in the separate diagnostic cate- 
gories is frequently 12 or less; and in Bobbitt and Newman’s (7) study 
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of 99 U. S. Coast Guard trainees, the number of individuals in the sepa- 
rate categories of medical treatments is often less than 10. The present 
authors doubt the dependability of statistics (including ¢) based on such 
small samples. It is particularly in connection with small samples that 
the question of the differential frequency of report of positive vs. nega- 
tive findings becomes important. 

In sum: largely because of the exigencies of the military situations 
in which the personality inventory validations were made, inadequate 
statistical procedures were sometimes employed. In consequence, some 
of the obtained validity coefficients or critical ratios are questionable, 
and others are definitely too high. 

9. Evaluation of ‘‘false positive’ results. Because of the limited pur- 
poses to which personality questionnaires were usually applied in mili- 
tary practice, and because of the fairly abundant manpower available 
for military activity in World War II, results were sometimes accepted 
as satisfactory which by civilian standards might be regarded as only 
questionably satisfactory. Thus, large numbers of false positives were 
apparently accepted without very great concern, on the assumption 
either that the error of classification would be corrected in later check- 
ups (usually by psychiatric interview) ; or that a fairly large amount of 
this type of error may be tolerated in military practice (so long as addi- 
tional recruits can be readily obtained through the draft). A false posi- 
tive diagnosis in civilian practice, on the other hand, is generally taken 
more seriously. 

It requires note, moreover, that in the military studies false positives 
were usually presented in terms of percentages rather than numbers of 
cases; and this tends to make the false positives seem less prominent 
than they actually are. In actual practice, using stipulated cutting 
scores, it appears that roughly from 20 to 30% false positives were found 
among unselected recruits, while 70 to 80% of the true positives might 
be detected by the inventories. But since the “normal” testees usually 
were far greater in number than the ‘“‘maladjusted”’ ones, 20% false posi- 
tives may easily represent a greater number of men than may 80% true 
positives. Thus, in a study by Wexler (65) of Naval recruits entering 
training, 319 false positives were screened, to only 71 true positives; in 
a study by Stone and Malament (56), 248 false positives to 79 true posi- 
tives; in a study by Satter (44), 211 false positives to 104 true positives; 
and in a study by Williams, Leavitt, and Mendola (73), 154 false posi- 
tives to 56 true positives. This means that even where, in terms of per- 
centage, the false-positive ratio is considerably smaller than the true- 
positive ratio, in terms of number of men, the group falsely classified as 
positive by the inventory may be considerably larger than the group 
correctly classified. This seems to be generally true, except in samples 
containing unusually large proportions of abnormal cases. Bota the 
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percentage and number of false positives could be reduced by raising the 
cutting score on the inventory. But this would increase the number of 
false negatives (see below). 

10. Neglect of ‘‘false negative’ results. False negatives on personality 
inventories represent those individuals who are not detected by their 
inventory scores, but who later prove to be definitely maladjusted. Un- 
like false positives (individuals who obtain unfavorable inventory scores 
but who subsequently prove to be “‘normal’’), false negatives are excep- 
tionally disadvantageous in a military situation. For it is the false nega- 
tives who become inefficient soldiers (whether enlisted men or officers), 
who cause considerable trouble to others, who may jeopardize the life 
or success of a combat unit, who eventually have to be discharged for 
neuropsychiatric reasons, and who tend to become governmental 
charges, in one form or another, for decades after a given war has ended. 


In most of the reported validity-studies of personality inventories, the num- 
bers and percentages of false negatives are not recorded. This arises from the 
fact that most of these studies employed only psychiatric screening (or prog- 
nostic) criteria; and when such criteria were employed, it was customary to 
submit to psychiatric interview only those diagnosed as “‘positive’’ by the inven- 
tory. The ‘negatives’? were not checked, and thus the false negatives or 
“misses” not detected.*® In studies such as those reported by Varney and Stone 
(59), Wexler (65), or Satter (44), where the criterion was that of successful 
performance or adaptability to military service, the number and percentage of 
false negatives are more likely to be either directly reported, or readily ascer- 
tainable from the data presented. 


The ratio of false negatives to true positives, where reported, seems 
to have been far from negligible. Theoretically, the ratio might be ex- 
pected to be high (a@) when the cutting score on the inventory is set high’® 
(reducing the number of false positives at the cost of increasing the false 
negatives), and (6) when the motivation for ‘‘covering up’’ unfavorable 
responses is strong (as seems to have been the case, for example, with 
individuals seeking admission to the Merchant Marine, the submarine 
service, and officer candidate school). With a high cut-off point, Shipley 
and Graham—who have made the most extensive studies in this field— 
estimate that ‘‘by using the short form of the [Personal] Inventory, 50 
percent of the potentially dischargeable men would be identified by in- 
terviewing 8 percent of the total population”’ (48, p. 5); the remaining 
50 percent of the potentially dischargeable men constitute the false 


* The neglect of the inventory-‘‘negatives”’ was due to the extreme shortage of psy- 
chiatrists, who were kept fully occupied examining the neuropsychiatrically more likely 
group of inventory-“positives.” 

#0 A “high” score on inventories of the kind considered in this paper is one that de- 
notes a comparatively high likelihood of neuropsychiatric unfitness for military service; 
the higher the score, the greater the presumed neuropsychiatric vulnerability. 
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negatives. While a reduction of the interview-sample to but 8 percent 
of the total population undoubtedly represents a precious saving of in- 
terview-time in military selection, it may be doubted whether an instru- 
ment with so high a false-negative rate would, in civilian practice, be 
considered of great service. Various other investigators, using lower cut- 
ting scores, have reported lower (but still very appreciable) percentages 
of false negatives. Thus, from data presented by Satter (44, p. 11), it 
appears that at a cutting score set to eliminate 33% of the sample, 104 of 
183 recruits who failed in Parachute School (for such reasons as refusal 
to jump, and indifference to continuing training) would be detected, 
while 79 would not be; here the proportion of false negatives is 43%. 
From figures presented by Wexler (65) for a large sample of Naval re- 
cruits entering training, it is clear that, with a cutting score rejecting 
20%, the proportion of false negatives is 28% for the Personal Inven- 
tory, and 29% for the Cornell Selectee Index. Varney and Stone, using 
an unspecified cutting score on the Maritime Service Inventory, to- 
gether with a check-list of illnesses and some follow-up observation of 
selected cases, report that ‘‘about one-quarter of those disenrolled for 
neurotic and psychopathic conditions were eventually referred for at- 
tention after having passed through screening undetected”’ (59, p. 46); 
that is to say, in this instance the false-negative rate was about 25%. 
Weider and his associates (35, 61, 62, 63), using the Cornell Selectee In- 
dex with a presumably low cutting score, report a false-negative rate of 
only 10-20%; this rate, while gratifyingly low, necessarily entails a com- 
pensatingly high false-positive rate. A similar remark applies to H. C. 
Leavitt’s (26) findings with the Neuropsychiatric Screening Adjunct 
Inventory. 

The personality inventories used by the armed forces also turned up 
false negatives in another sense; that is, by proving quite ineffective in 
the diagnosis of certain psychiatric syndromes while they more effec- 
tively diagnosed others. Thus, Varney and Stone state that the Mari- 
time Service Inventory failed to screen ‘‘most psychotics [and] a large 
proportion of organic cases... ’”’ (59, p. 46). Similarly, H. C. Leavitt 
states that “a number of psychopathological syndromes are not effec- 
tively detected by the tests’’ (26, p. 356). 

A practical fact in connection with false negatives is this: the vet- 
erans’ hospitals are now caring for large numbers of neuropsychiatric 
patients, who obviously were moi screened and culled successfully by 
personality inventories or psychiatric interviews. One reason for this 
may be that the armed services failed to apply the screening technique 
regularly, or to abide by the results consistently. Another reason may be 
that it was considered advisable to reduce the number of false positives, 
even at the cost of increasing the number of false negatives; if so, this of 
course represents an administrative adjustment to the uncertainties of 
the diagnostic technique; and the resulting increase in false negatives 
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(some of whom doubtless ended in veterans’ hospitals) is chargeable to 
the inadequacies of the diagnostic technique." 


It sometimes appears that writers gave undue emphasis to the detected 
psychiatric cases (the true positives) rather than to the undetected ones (the 
false negatives). After all, the detected cases are probably those who are the 
least difficult to catch—as evidenced by the fact that they were detected. 
(As in crime: the criminals caught are those who are easiest to catch.) 


A critic has pointed out to the writers that civilian, as well as mili- 
tary, validity studies tend to pay relatively little attention to the false 
negatives—and we agree. However, the false negatives seem to us more 
important in the military situation than in the civilian. In the first 
place, as already mentioned, the life and success of a combat unit may 
well hang on the teamwork and dependability of the individuals in the 
unit: the front line is a poor place for the malfunctioning or breakdown 
of a false-negative case. In the second place, the military samples to 
whom personality inventories are commonly applied contain a consid- 
erably larger proportion of normal people than the civilian samples to 
whom personality inventories are ordinarily applied in practice. In a 
sample with a preponderance of normals, the major error (so far as num- 
bers of cases are concerned) would be to mis-diagnose or mis-prognose 
the normals as abnormal. Hence, in such a sample, it is important to 
reduce the proportion of false positives: and this can be accomplished 
by setting the cutting score high. But the effect of a high cutting score 
is to imcrease the incidence of false negatives. Until the inventories be- 
come more efficient, this is the necessary cost at which an acceptably 
low rate of false positives can be achieved. If there is any merit in either 
of the arguments in this paragraph, it appears that the neglect of false 
negatives in the military studies leads to a more optimistic conclusion 
than would a similar neglect in studies of typical civilian clinical groups. 

11. Specialized design, validation, and standardization. Probably one 


1 Dr. Glenn V. Ramsey of Princeton University is inclined to take issue with this 
statement, on the ground that it is equivalent to demanding that an inventory predict 
all future neuropsychiatric breakdowns. In his own terms: “In the first place it is de- 
manding that the inventory perform a task which is impossible even with the use of all 
available scientific knowledge and skills which are adaptable to a military setting. It 
would be more profitable and practical to attack this diagnostic problem by constructing 
inventories designed for specified purposes, such as the general recruit screening, adjust- 
ment to a specific program or activity, etc. Secondly, it is necessary to consider the pos- 
sibility of specificity of vulnerability, in addition to the basic or general stability factor. 
The individual may or may not encounter during his military career specific situations or 
conditions which would precipitate a neuropsychiatric breakdown as a consequence of 
specific vulnerability. Demanding that an inventory predict breakdowns from both 
general and specific factors along with those resulting from psychosis, brain pathologies, 
epilepsy, etc., is asking for a degree of efficiency that is beyond reasonable expectations, 
in view of our present knowledge concerning these matters.” 
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of the main reasons for the relative success of personality inventories in 
military practice is the fact that in many instances the instrument was 
specifically designed for the group to whom it was ultimately applied, 
and specifically validated and standardized in this same group. Further, 
the tests which were most frequently employed in military practice— 
such as the Personal Inventory and the Cornell Selectee Index—were 
not only originally designed for military men and their problems; but 
also, when these instruments were applied to specialized military groups 
(such as Air Force candidates or submarine trainees) they were fre- 
quently modified for the particular groups to whom they were applied. 


For example, Shipley and his associates (48, 49, 52), when validating the 
Personal Inventory, consistently determined item validities which applied spe- 
cifically to each new kind of group. Almost all the other military experimenters 
who employed the Personal Inventory also tended to follow this custom- 
tailored validation procedure. Wolff and his associates, reporting on the 
standardization of the Cornell Selectee Index, noted that “it is worthy of 
emphasis that each item in the Index was incorporated only after having been 
exposed to an exhaustive item analysis and statistical validation” (75, p. 3). 
Here again this type of specialized validation procedure paid good dividends. 
Altus and Bell (2), in working with illiterate Army Special ‘Training Center 
candidates, carried out item validations on the Bell, Minnesota Multiphasic, 
and Army Adjustment schedules before they selected the particular questions 
which applied to the group with whom they were working; and then they de- 
vised an oral administration of their composite inventory, since the ordinary 
printed form would obviously be useless with illiterates. 


The military experimenters, moreover, seem always to have employed 
some specific criterion group in their standardization procedures. They 
did not follow the too common civilian practice of ‘‘validating’’ their 
questions exclusively in terms of the criterion of internal consistency. 
If they wished to measure neurotic or psychotic tendencies with their 
instruments, they invariably employed neurotic or psychotic criterion 
groups in their standardization procedures. This kind of specialized 
external validation almost surely accounted for much of the success of 
the inventories employed by the armed forces. 


In defense of civilian practice, it must be mentioned that the specialized 
construction, validation, and standardization of personality inventories re- 
quires large funds and large, appropriate samples—conditions which the civilian 
experimenter can rarely hope to achieve. 


12. Realistic application. While civilian psychologists and educators 
sometimes apply personality inventories to diagnostic problems for 
which they were not originally intended, the military users of these in- 
struments were usually much more realistic, and demanded of the tests 
only the limited applications for which the tests are suited. As Zubin 
has aptly stated: 


Perhaps the most important factor was the lower level of aspiration which 
these inventories adopted. They were, from their very beginning in World War 
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I, not regarded as personality tests; they were merely sieves separating the 
recruits into two groups—those who had to be screened further by a clinician in 
a short personal interview, and those who needed no further screening”’ (76,p. 
58. See also reference 24). 


PERSONALITY INVENTORIES IN MILITARY PRACTICE 





This, apparently, is what personality inventories can do; and this, and 
little more, is what their military users commonly asked of them. 


MILITARY VALIDATION MAKING USE OF PERFORMANCE-MEASURES 


The second method of estimating the value of personality inventories 
in military practice made use of performance-measures. Thus, inven- 
tories were commonly administered to the members of a training course, 
and later the relation was determined between inventory scores on the 
one hand, and success or failure in the course, on the other. The relevant 
data from military studies using some kind of performance-measure are 
listed in Table IT. 

Of the studies summarized in Table II, it may be observed that only 
seldom did the personality inventories prove distinctly effective in pre- 
dicting or discriminating successful from unsuccessful performance. 
This negative finding suggests that the role of intelligence in favoring 
both successful performance and ‘‘adjusted”’ responses to the inventory- 
items is not very strong. The negative finding is, of course, quite the op- 
posite of the generally favorable results when the inventories were 
validated against psychiatric criteria. Some reasons for this contrast are 
suggested below. 

1. Prior elimination of abnormals. In many of the military studies, 
both the ‘“‘pass’’ and the “‘fail’’ groups had undergone some prior selec- 
tion with respect to minimum emotional fitness. Hence the lower end 


TABLE II 


MILiTaRy VALIDITY STUDIES OF PERSONALITY INVENTORIES, 
MAKING USE OF PERFORMANCE-MEASURES 








Performance- 





Source Group Tested Result 
Measure 
Personal Inventory 

Cerf (8) 1419 AAF pilots Pass vs. fail A biserial validity coefficient of 
in primary .06 was found. “The obtained 

training coefficient is barely significant at 

the 5% level, but it is in the un- 

expected direction.... The 


Shipley Personal Inventory, for- 
mat B, is not useful as an instru- 
ment for the prediction of gradu- 
ation-elimination from primary 
pilot training.” 
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TABLE II (continued) 





Source 


Group Tested 


Performance- 
Measure 


Result 





Graham, 
Mote, & 
Berry (18) 


Leavitt, 
Williams, & 
Lipkin (29) 


Lepley (30) 


Lepley (30) 


Lepley (30) 


Lepley (30) 


Lepley (30) 


Satter (45) 


Satter (44) 


Personal Inventory (continued) 


school trainees 


670 Marine 
officer 
candidates 


121 AAF radar 
trainees 


297 AAF lead 
bombardiers 


303 AAF lead 
bombardiers 


267 AAF lead 
navigators 


265 AAF lead 


navigators 


1400 enlisted 
Naval personnel 


1079 enlisted 
parachute 
trainees 


tank escape per- 
formance 


Pass vs. fail 


Pass vs. fail 


Strike-photo anal- 
ysis: radial errors 
in 100’s of feet 


Strike-photo anal- 
ysis: per cent hits 
within 1000 feet 


Strike-photo anal- 
ysis: radial errors 
in 100’s of feet 


Strike-photo anal- 
ysis: per cent hits 
within 1000 feet 


Ratings of the men 
by their subma- 
rine officers 


Pass vs. fail 


2608 submarine Pass vs. fail on A tetrachoric validity coefficient 


of .14 was found; and a non- 
significant Chi-square of .58 for 
1 d.f. was found between pass- 
fail criterion and PI scores. 


The obtained biserial validity 
coefficient of .28 “is not suffi- 
ciently high to warrant use of 
the MFRL Stencil in individual 
prediction, but it is high enough 
for limited selection of large 
groups.” 


Non-significant validity coeffi- 
cients ranged from —.15 to .06 
on various subtests. 


Non-significant biserial validity 
coefficients ranged from —.07 to 
-10 on various subtests. 


Non-significant biserial validity 
coefficients ranged from —.12 to 
.07 on the subtests. 


Non-significant biserial validity 
coefficients ranged from —.08 to 
.07 on the subtests. 


Non-significant biserial validity 
coefficients ranged from —.09 to 
-11 on the subtests. 


Correlations between Personal 
Inventory scores and the cri- 
terion ratings were low, and in 
no case significantly different 
from zero. 


A validity correlation of .39 was 
obtained between the pass-fail 
criterion and the inventory 
scores, Critical ratios (#’s) of the 
mean score group-differences 
were always 4.1 or greater. 
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TABLE II (continued) 





Performance- 





Source Group Tested Steeibie Result 
Personal Inventory (continued) 
Shipley, 1466 Naval Pass vs. fail “The Personal Inventory identi- 
Gray & trainees fied a significant proportion of 


Newbert (51) 


Williams & 
Leavitt. (68) 


Williams, 
Leavitt, and 
Blair, (71); 
also Leavitt & 
Adler (27) 


Williams, 
Leavitt, & 
Mendola (73) 


Stolurow, 
Irion, & 
Pascal (54) 


the 52 men who were later dis- 
charged. 21% of these had re- 
ceived scores of 18 or above on 
the PI, as compared with but 
5% of the active men. The mean 
score for the discharges was sig- 
nificantly higher—4.3 points 
(CR=3.9)—than that for the 


active group.” 


1039 officer Pass vs. fail “The Personal Inventory Form 

candidates 4 successfully identified a signifi- 
cant proportion of the men who 
later failed in OCS... . The ex- 
tent of this relation is indicated 
by a correlation of .48 between 
PI scores and OCS success and 
failure (computed biserially).” 


185 Marine Ratings by supe- ‘There is very little, if any, rela- 
officer rior officers on _ tionship between the inventory 
candidates combat proficiency scoresandthecombatrating.... 


At no point on the PJ scale could 
a cutting score have been se- 
lected which would profitably 
eliminate men who would later 
prove to be poorincombat.... 
The tetrachoric correlation turns 
out to be .15, an insignificant 


figure.” 
757 Marine Pass vs. fail The critical ratio of the mean 
Corps officer score group-difference was 3.6; 
candidates and the biserial validity coeffi- 


cient was .18. 


Personal Inventory and Cornell Selectee Index 


300 instructors Upper 27% vs. “Of the 60 items, 41 discrimi- 

at AAF Gun- lower 27% of the nated between the high and the 

nery Instruc- candidates low groups on either the Per- 

tors’ School sonal Inventory or the Cornell 
Selectee Index, with at leasta dis- 
crimination index (Chi-square) 
which would have been expected 
to occur by chance less than 5 
times in 100.” 
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TABLE II (continued) 
Source Group Tested Performance- Result = 
Measure 
N RC Neurotic Inventory 
Satter (45) 1400 enlisted Ratings ofthe men Correlations between inventory 
Naval personnel by their subma- scores and the criterion ratings C 
rine officers were low and non-significant. ’ 
Humm-Wadsworth Inventory 
Cerf (8) 202 pilots in pri- Pass vs. fail Biserial validity coefficients on C 
mary training the 8 Scales ranged from —.21 
to .13, only two of them being 
significant—for the Hysteroid 
and the Epileptoid Scales, 
Cerf (8) 200 pilots in pri- Pass vs. fail Biserial validity coefficients on 
mary training the 8 Scales ranged from —.22 to M 
.16; only those for the Hysteroid ot! 
and Epileptoid Scales were sig- 
nificant. 
Cerf (8) 200 AAF pilots Pass vs. fail Non-significant biserial validity 
in primary train- coefficients ranging from —.01 Wi 
ing to .16. Le: 
Bla 
, , a : als 
Cerf (8) 193 AAF pilots Pass vs. fail* Non-significant Chi-squares of ov: 
in training 1.05 and 4.82 for 4 degrees of Bla 
freedom. ‘More than 90% of 5 
chance deviations would have 
been as great.” 
Cerf (8) 195 AAF pilots Pass vs. failt Non-significant Chi-squares of 
in training 11 and .34 for 2 d.f. ‘More 
than 60% of chance deviations Wil 
would have been as great.... fF Lea’ 
No significant relationships were [ Mer 
found between pilot success in [ also 
primary flying school and rat- | and 
ings either of Dr. Humm'’s anal- ' (69) 
yses of temperament integration | 
or of his case summaries.” 
* Case summaries made by Dr. Humm from the responses to the Inventory were com- Cerf 
pared with training-course success or failure. i 
¢ Case summaries concerning temperamental integration, made by Dr. Humm from [ 
the responses to the Inventory, were compared with training-course success or failure. : 
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TABLE II (continued) 





Source 


Group Tested 


Performance- 
Measure 


Result 





Cerf (8) 


Cerf (8) 


Miles & 
others (34) 


Williams, 
Leavitt, & 
Blair (72); 
also Leavitt, 
Williams, & 
Blair (28) 


Williams, 
Leavitt, & 
Mendola (74); 
also Williams 
and Leavitt 
(69) 


Cerf (8) 


275 AAF bom- 
bardiers, navi- 
gators, and pi- 
lots in advanced 
training 


83 pilots in pri- 
mary training 


Information Blank 


Pass vs. fail 


Pass vs. fail 


Non-significant biserial validity 
coefficients of —.05, .07, and .05 
were obtained between the test 
scores and the criterion. 


Non-significant biserial validity 
coefficients ranging from .16 to 
—.21 were obtained. “It does 
not predict graduation or elim- 
ination from primary training.” 


Self Descriptive Inventory 


3104 Marine 
trainees 


Pass vs. fail 


At a cutting score of 6, 80% 
positives were detected at the 
expense of 24% false positives 
and 20% false negatives. 


Confidential Questionnaire 


666 Marine 
officer 
candidates 


Pass vs. fail 


“The scores ... correlated with 
the Platoon Commanders School 
pass-fail criterion to the extent 
of .21 (biserial).... / A small but 
substantial difference appears 
between the mean scores of the 
395 successful candidates and 
the 271 unsuccessful candidates, 
which are 14.6 and 13.8 respec- 
tively.” 


Personal Preference Questionnaire 


649 Marine 
Corps officer 
candidates 


Pass vs. fail 


Validity coefficients of .31 and 
.25 were found between the cri- 
terion and the Modesty-Egoism 
and Social Judgment Scales. 
Critical ratios of the mean score 
group-differences were 6.5 and 
5.1 respectively. 


Minnesota Personality Scale 


338 AAF pilots 
in primary train- 
ing 


Pass vs. fail 


Non-significant biserial validity 
coefficients ranged from —.09 to 
.09. “It appears that the Minne- 
sota Personality Scale, CE 438A, 
has no value for predicting suc- 
cess in primary pilot training.” 
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TABLE II (continued) 





Source Group Tested Performance- Result 





Measure 





Cerf (8) 


Jensen & 
Rotter (22) 


Altus (1) 


Altus & 
Bell (2, 3) 


Cerf (8) 


Minnesota Multiphasic Inventory 


400 AAF pilots Pass vs. fail 
in primary train- 
ing 


1548 Army offi- Officer candidates 
cer candidates vs. outstanding of- 


and officers ficers 


Item validities showed 72 sig- 
nificant phi’s out of 699 items. 
“In view of the apparently uni- 
modal distribution of phi’s with 
a central tendency at zero, it is 
probable that there are few, if 
any, genuinely valid items in 
this collection for the prediction 
of primary pilot training suc- 
cess.” 


Neither the Psychasthenia Scale 
of the MMPI nor the C-Inven- 
tory differentiated significantly 
between the groups of officer 
candidates and outstanding of- 
ficers. 


Bell-MMPI-Army Adjustment Inventory 


3614 Army Spe- Pass vs. fail 
cial Training 

Center  candi- 

dates 


200 Army Spe- Pass vs. fail 
cial Training 

Center candi- 

dates 


The inventory validly distin- 
guished passing from failing 
ASTC candidates in certain in- 
stances. Tetrachoric correlation 
validity coefficients ran as high 
as .64. 


The inventory significantly dif- 
ferentiated the passing from the 
failing candidates, the critical 
ratio of the mean score group- 
difference being 8.49. A biserial 
correlation of .45 between test 
scores and criterion was also ob- 
tained. 


Bernreuter Personality Inventory 


800 AAF pilots Pass vs. fail 
in primary train- 
ing 


On item analyses, only 6 phi’'s 
out of 108 reached or exceeded 
the 5% level of significance. 
“This instrument contains an 
insufficient number of valid 
items for the prediction of pri- 
mary pilot success to make fur- 
ther scoring measures or statis- 
tical analysis worth while.” 
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TABLE II (continued) 





Source Group Tested Performance- Result 
Measure 





Personal Audit 


Cerf (8) 271 pilots in pri- Pass vs. fail “The biserial coefficients range 
mary training from —.12 to .09, which are well 
within the range to be expected 
of a chance distribution of bi- 
serial correlations, the true mean 
of which is zero . . . . This test is 
of no value in predicting pilot 
performance in primary train- 
ing.” 


Inventory of Factors GAMIN 


Cerf (8) 782 AAF pilots Pass vs. fail Non-significant biserial validity 
in primary train- coefficients ranged from —.05 to 
ing .04. Item validities showed 45 


significant phi’s out of 424 items. 
“The Inventory of Factors holds 
practically no promise as an in- 
strument for predicting gradua- 
tion-elimination from primary 
pilot training.” 


Inventory of Factors STDCR 
Cerf (8) 1106 AAF pilots Pass vs. fail Biserial validity coefficients 
in primary train- ranged from .03 to —.09; 42 out 
ing of 304 items were found to have 


significant phi’s. “It appears 
that the Inventory of Factors 
STDCR is not promising for pre- 
dicting graduation-elimination 
from primary pilot training.” 


Guilford-Martin Personnel Inventory 


Cerf (8); also 945 AAF pilots Pass vs. fail Significant biserial validity co- 
Guilford (19) in primary train- efficients from .10 to .14 were 
ing found. Item validities showed 

50 significant phi’s out of 275 

items. “It appears that the 


Guilford-Martin Personnel In- 
ventory has some promise for 
predicting graduation-elimina- 
tion from primary pilot train- 
ing.” 
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of the distribution of emotional adjustment was more or less eliminated ; 
and this would tend to reduce or conceal such relation between inven- 
tory-scores and performance-measures as might actually exist in unse- 
lected samples. This would be especially true if the relation between 
emotional adjustment and performance-measures is curvilinear (i.e., 
stronger at the lower end than in the middle and upper sections). 


Thus, in Graham, Mote, and Berry’s (18) use of the Personal Inventory with 
submarine school recruits, it was noted that about eight per cent of the candi- 
dates had, for psychiatric reasons, been refused admission to the school; so that 
the group which remained to pass or fail in training was presumably ‘“‘normal.” 
Since both the passing and the failing candidates had already been neuropsychi- 
atrically screened, the absence of a significant difference in mean inventory- 
scores between the two groups may merely testify to the efficiency of the previ- 
ous screening. 

Similarly, Jensen and Rotter (22) used the Minnesota Multiphasic Inven- 
tory to try to distinguish officer candidates from outstanding officers, and found 
no significant difference. But since officer candidates go through several rigor- 
ous screenings before they ever reach officer candidate schools, there is little 
reason to believe that the MMPI could distinguish between the two groups 
employed in this study. 

Again: Satter (45) found that the Personal Inventory did not significantly 
distinguish between submarine men who received high and low performance 
ratings by their officers. As Satter points out, “it is conceivable that, since the 
men were in part selected on the basis of Personal Inventory scores, the low 
validity coefficients are the result of a curtailed range of adjustment”’ (45, p. 10). 


It may also be pointed out that if the passing and the failing group 
—or the superior and the less-superior group—are both under motiva- 
tion (conscious or unconscious) to obtain favorable inventory scores, 
then significant inventory-differences between the two groups are less 
likely to be found. Thus, as mentioned, Jensen and Rotter (22) found 
no significant difference in inventory scores of officer candidates vs. 
outstanding officers. Similarly, in a sample of submarine men (all vol- 
unteers, and ordinarily preferring to remain in the submarine branch), 
Satter (45) found no significant relation between inventory scores and 
performance-ratings. 

2. Unreliability and invalidity of performance-measures. Data are 
lacking by which the reliability and validity of military performance- 
measures (typically pass-fail in a training-course) could be compared 
with the reliability and validity of the psychiatric criterion; and in any 
event, it is too easy to blame the shortcomings of a test on the variable 
against which it is being correlated. Nevertheless, it may be worth call- 
ing attention to the fact that serious unreliability or invalidity of a per- 
formance-measure will tend to reduce or conceal whatever correlation 
may exist between the performance-measure and the inventory scores. 
In Williams, Leavitt, and Blair’s (71) study, for example, scores on the 
Personal Inventory were found to be negligibly related to ratings of Ma- 
rine officer candidates on combat proficiency. But such ratings (it seems 
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fairly well agreed) are themselves of dubious reliability and validity; and 
it is conceivable that marked improvement of the ratings might raise the 
negligible relation to a statistically significant and practically useful 
level (15). 

3. Residual relationship. Individual differences in performance- 
measures ordinarily depend more largely on differences in aptitude and 
previous training, than on differences in emotional adjustment.” At 
best, then, it is only what may be termed a residual portion of perform- 
ance that is dependent on the adequacy of emotional adjustment. In 
short, a high correlation between personality-inventory scores and per- 
formance-measures is not to be expected. By contrast, the psychiatric 
criterion provides a direct or alternative measure of what the military 
personality-inventories attempt to measure. 

4. Shift of original criterion. Finally, it should be remembered that 
the personality inventories were originally validated against a psychiat- 
ric criterion, and not against performance-measures. It is possible that 
item analysis, making use of performance measures as criteria, could 
raise the residual relationship between personality-inventory scores and 
performance-measures to a practically useful level. 


SUMMARY AND CONCLUSIONS 


Military applications of personality inventories have yielded enough 
favorable results to command attention. In contrast, personality in- 
ventories in civilian practice have generally proved disappointing. To 
throw some light on this contrast, the writers undertook to review the 
available papers on military validation of personality questionnaires. 
Table I presents a summary of studies making use of a psychiatric 
criterion (prognosis or diagnosis of neuropsychiatric unfitness for 
military duty); Table II presents a summary of studies making use of a 
performance-criterion (most commonly, success in training-courses). 
Detailed consideration of the various studies leads to the conclusion 
that both spurious and legitimate factors account for the superior 
showing of the personality inventories in military practice. With regard 
to studies making use of a psychiatric criterion, the following factors 
appear to have played a part in the results obtained: 

1. Criterion contamination (knowledge of inventory scores at time of making 
psychiatric prognosis or diagnosis). 

2. Criterion overlap (duplication of questions asked in the inventory and by 
the psychiatrist). 


” There are exceptions to this general statement, especially when the performance is 
of such a nature as to depend chiefly on persistence of attention, on conscientiousness of 
effort, or on courage and daring (44). But the statement appears to be true at least for 


the types of performance-measures usually employed in the military studies of personality 
inventories, 
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3. Use of extreme or atypical groups. The use of extreme groups or of groups 
containing an atypically large proportjén of psychiatrically ‘‘positive”’ cases 
leads to atypically favorable results (lSwer incidence of ‘‘false positives,” higher 
biserial or tetrachoric correlation between inventory and criterion). 

4. Differential motivation. In some studies making use of two contrasted 
groups, the motivation of the ‘‘abnormal” group toward self-incriminating re- 
sponses, and/or of the normal group toward self-inflationary responses, proba- 
bly led to exaggerated differences between the groups and a spuriously high 
index of validity for the personality inventory. 

5. Statistical inadequacies. In some of the studies, questionable or inflated 
indexes of validity were obtained through: calculation of biserial correlations 
for samples containing an extraordinarily high proportion of abnormals; use of 
samples containing very small numbers of cases in the diagnostic categories; 
reliance upon the critical ratio or ¢, without considering the degree of relation 

~ connoted by the given C.R. or ¢; and use of the original standardization sample 
(instead of a fresh sample) for validation purposes. 

6. Lenient evaluation of false-positive results. Except in samples containing 
an unusually large proportion of psychiatrically ‘‘positive’” cases, the number 
of cases falsely classified as positive by the inventory generally exceeds, by a 
great deal, the number correctly classified as positive. Each false-positive case 
may, however, be viewed as a misclassified normal case; the percentage of mis- 
classification (percentage of false-positive to total normal) is generally small. In 
most of the studies, the percentage-interpretation prevails, rather than the 
more practical interpretation in terms of number of cases. 

7. Neglect of ‘‘false-negative’’ cases. In most validity studies (both military 
and civilian), the number and percentage of false-negative cases are not re- 
corded. Since false negatives are a more serious liability in the military situation 
than in the civilian, the neglect of false negatives may lead to undue optimism 
in evaluating the military studies. 

8. Sample heterogeneity. As compared with civilian examinees, the armed 
forces’ unscreened recruits frequently included a larger proportion of lower-end 
(and comparatively easily diagnosable) cases, such as unemployables, “bums,” 
alcoholics, frank neurotics, etc. 

9. Lower level af intelligence. It appears likely that the comparatively direct 
questions of personality inventories have greater validity when used with less 
intelligent or more naive individuals such as are found in military samples, 
rather than the selected samples commonly used in civilian studies. 

10. Honesty of response. It appears likely that members of the armed serv- 
ices answered the personality inventories with less distortion than civilians 
ordinarily do. Among possible reasons for this is the fact that the direct penal- 
ties for falsification are greater in the military situation than in the civilian. 

11. Specialized design, validation, and standardization. The inventories in 
most common use were specifically designed for military groups, and often were 
modified when applied to a group different from the standardization-group. 

, Moreover, external criterion groups were used for validation purposes, instead 
Vv of merely the criterion of ‘internal consistency.” 

12. Realistic application. In general, the inventories were applied for screen- 

ing only, and not for elaborate personality analysis. 


While the military investigators should be credited with the favor- 
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able factors mentioned above, it would be unfair to charge them with 
ignorance or neglect of the other factors. The exigencies of military 
circumstances generally prevented the application of any completely 
ideal methodology, even when the experimenter was keenly aware of 
the shortcomings of his samples or his procedures. 

In contrast to their success in relation to the psychiatric criterion, 
the military personality inventories proved generally ineffective for 
predicting performance-measures (such as successful completion of a 
training-course). The reasons for this difference appear to be as follows: 


1. Prior elimination of abnormals. Most men selected for training-courses 
had already been screened or otherwise certified as to minimum necessary emo- 
tional fitness. 

2. Unreliability or invalidity of the performance-measures. This may be a 
contributing factor to the unfavorable results, at least in a few instances. 

3. Residual relationship. Individual differences in performance-measures 
ordinarily depend more largely on differences in aptitude and previous training, 
than on differences in emotional adjustment. The relationship between per- 
sonality-inventory scores and performance-measures, in consequence, is at best 
of only residual strength. 

4. Shift of original criterion. The original validation of the personality in- 
ventories was in terms of a psychiatric criterion, and not in terms of perform- 
ance measures, 


In conclusion it appears that, while the experimental or statistical 
shortcomings of many of the military studies justify a cautious attitude 
toward the results obtained, the fact cannot be ignored that the in- 
ventories usually did make some definite contribution to psychiatric 
screening. The success of the inventories in the military situation en- 
courages the hope that similar inventories may prove equally useful in 
civilian practice. Military experience suggests emphasis on the follow- 
ing points: 


1. Personality questionnaires should be especially designed for the group to 
whom they are applied, and should be validated against dependable external 
criteria. Criterion-contamination should be guarded against; and criterion- 
overlap, if it occurs, should be taken into account in evaluating the findings. 

2. Special attention should be given to persuading or inducing respondents 
to answer the inventory-items as truthfully as they can. 

3. Personality inventories may possibly be more effective when used with 
relatively uneducated and less intelligent groups, than with groups that are 
more sophisticated. 

4. The users of personality inventories should realize that only limited and 
specialized demands may be made on the inventory technique; and that broad 
and incisive personality diagnosis is still the specialty of the trained clinician 
employing subtler and more comprehensive psychological techniques. 
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THE LATIN SQUARE PRINCIPLE IN THE DESIGN AND 
ANALYSIS OF PSYCHOLOGICAL EXPERIMENTS 


DAVID A. GRANT! 
University of Wisconsin 


INTRODUCTION 


The latin square, as such, may be used infrequently in psychological 
investigation, but the basic principles of experimental design and analy- 
sis embodied in the latin square may be used to increase greatly the 
efficiency of many types of psychological research. This fact has al- 
ready been pointed out (6, 15), but a more extensive discussion seems 
desirable. It is the purpose of this paper to outline some of the relevant 
features of the latin square and to illustrate how these features apply 
in several kinds of psychological experiments. It will be assumed that the 
reader has already acquired a familiarity with the rudiments of analysis 
of variance (3, 11, 14). 

A latin square is an arrangement of latin letters in rows and columns 
such that each letter appears once and only once in each row and each 
column (4, 7). As an example, a five-by-five latin square is given below. 


C E BDA 
EACB D 
B. @..h..a¢ ¢ 
48 2B © 2 
DC AE B 


Such latin squares are typically used in agricultural field experiments 
to control soil variability. Thus A, B, C, D and E typically represent 
the independent variable, say of five different fertilizer treatments used 
on 25 small plots of ground arranged as in the above square. Any nat- 
ural soil gradients parallel to the rows or the columns of the square will 
introduce irrelevant variability into the plot yields, but variation from 
such gradients can be eliminated statistically so that the precision of 
the comparison between treatments will be increased. The analysis of 
variance for this latin square will be presented later. 

Some of the important features of the latin square stand out when 
this design is contrasted with a three-factor experiment in which each 
level of each factor appears with each level of every other factor (4, 7). 
If rows, columns, and treatments are considered as independent vari- 


1 This paper was completed during a research leave supported by the Graduate Re- 


search Committee of the University of Wisconsin from special funds provided by the 
State Legislature for 1947-48. 


427 





428 DAVID A. GRANT 


ables or ‘“‘main effects,’’ both designs contain three independent vari- 
ables which are manipulated by the experimenter. The three in- 
dependent variables are, moreover, orthogonal to or statistically in- 
dependent of each other in both designs. The crucial difference between 
the two designs appears in the evaluation of the interactions. In the 
factorial design, all three first-order interaction effects and the second- 
order interaction effect can be separated and evaluated.? The interac- 
tions are all orthogonal to the main effects and to each other. This is 
not the case with the latin square. In the latin square the interactions, 
if present,’ are confounded (or mixed in) both with the effects of the 
single independent variables and also with each other. The interactions 
ordinarily cannot be evaluated in a latin square design. Although the 
confounding effects balance out in complete sets of squares, for any 
given square, the confounding may be serious enough to counteract or 
to enhance a main effect. Although this is relatively well-controlled in 
agricultural applications, in psychological applications the experimenter 
must at least be aware of the consequences of interactions confounded 
with main effects. The exact nature of this confounding will be demon- 
strated in connection with the 2 X2 latin square . 


THE 2X2 LATIN SQUARE 


The 2 X2 latin square gives a striking illustration of the confounding 
principle just mentioned. There are two 2X2 latin squares, one of 
which is shown below: 


I II 
1 Aun By 
2 Bu Ax 


The subscripts indicate the row and column position of each entry in the 
square. Analysis of this square is very instructive. Suppose that A 
and B represent two fertilizer treatments. Two parallel analyses of this 


square are given in Table I, considering it first as a factorial design and ~ 


then as a latin square design. The correction term, 
C=3(Au+ But Bat An). 


Examination of Table I reveals that the row and column sums of 
squares are identical, but that the term known as “‘interaction’’ in the 
factorial design is the same as that known as the “treatment effect’’ in 


2 Assuming an error estimate from replication. 
* In agricultural work such interaction might arise from a fertility gradient along a 
diagonal of the square. 
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TABLE I 


OvuTLINED ANALYsIS OF 2X2 LaTIn SQUARE CONSIDERED (1) ASA 
FACTORIAL DESIGN AND (2) AS A LATIN SQUARE DESIGN 




















tac (1) Factorial Design (2) Latin Square Design 

Source of Variation Sum of Squares Sum of Squares 

Rows $(Aun+By)*+}(Bu+An)*—C $(Au+Bu)*+3(Bu+An)?—C 
Columns 4(Aun+Bu)?+4(Bu+An)?—C 3(Au+Bu)*+}(But+An)*—C 
Row X Column 

Interaction $(Au+An)?+3(ButBu)?*—C 
Treatment $(Au+An)?+}(ButBu)?—C 
Total Ay?+Bi? +Be?+Ax? —C Ay? +Bi?+B2?+A2? —C 





the latin square. This is always the case with the 2 X2 latin square; the 
treatment effect is identical with the row X column interaction. This 
identity represents total confounding of these two sources of variation. 
It is easy to see, furthermore, that the interaction of any pair of in- 
dependent variables is completely confounded with the third independ- 
ent variable. Obviously one could not determine the row X column in- 
teraction effect because of the complete confounding of that with the 
treatment effect. Probably more important, one could not determine 
treatment effects because of the possible presence of row X column inter- 
action. 

In the larger squares confounding occurs, but it is not complete. The 
agricultural experimenter is not greatly inconvenienced by this type of 
confounding, because he takes pains to use larger randomly selected 
squares and avoids the use of systematic squares (4, sec. 34) which 
emphasize this effect.4 Moreover he typically uses both rows and 
columns to effect a double elimination of a single irrelevant variable, 
soil heterogeneity. The psychological experimenter who uses latin 
squares to control two different irrelevant variables may find, however, 
that confounding can cause him serious difficulties. This is particularly 
true if he uses the 2 X2 latin square. 

The 2X2 latin square arises rather frequently in experimental 
psychology. Often each S is run through an experimental (Z) and a 
control (C) procedures. The careful investigator will usually run his 
half of his Ss in the sequence E-C, and half in the sequence C-E, and 


‘ The selection of squares at random reduces the expected value of confounding to 
zero. In thecase of a NX N latin square, the expected value of the treatment mean square 
equals the error variance plus N times the treatment variance. The technique of selecting 
a square at random is described by Fisher and Yates (5, p. 13). 





ot-0 ron gemmpemeneem, 
prea bos eN 


430 DAVID A. GRANT 


the 2X2 latin square’ results. If the sequence (rows) in which the two 
procedures are run interacts with ordinal position effects (columns)—in- 
volving perhaps habituation, practice, fatigue, etc.—this interaction 
will be completely confounded with the difference between the control 
and experimental procedures. This follows from the fact noted above, 
that in the two by two latin square the interaction of any two of the 
three factors, rows, columns, and letters, is completely confounded with 
the third factor. It should be remarked that the confounding is a prop- 
erty of the design and will be present no matter whether the data are 
completely analyzed by means of analysis of variance or are incom- 
pletely analyzed by means of critical ratios or ¢-tests. 

A recent study on the T.A.T. (9) serves as an example in which such 
effects may be present. The intention was to discover whether there 
were differences in the themas evoked by the cards numbered 1-10 as 
compared with the themas evoked by the cards numbered 11-20. 

Thus, the differential stimulus value of the cards numbered 1-10 
and cards 11-20 was the independent variable of the experiment. Half 
of the Ss were given the cards in the normal sequence 1-10 then 11-20, 
and half were given the cards in reverse sequence 11-20 then 1-10 be- 
cause the actual order in which the cards were presented was irrelevant 
to the experiment. This can be presented diagramatically as follows: 


Session 
. I II 
S Normal 1-10 11-20 
S Reverse 11-20 1-10 
Y 


The first session is indicated by J, and the second session by JJ. 
Briefly stated, the results showed that more “‘personal,”’ hostile, and in- 
secure themas appeared in the second ten, and this difference held 
especially during the second session. 

The presence of row Xcolumn or session Xsequence interaction in 
this experiment is not known, but there may be grounds for suspecting 
its existence. For example, Ss might have been generally cautious and 
unsure of themselves in the first session which could have depressed the 
frequency of personal themas in both sets of cards. In the second 
session, Ss, in general, might have been more relaxed and more willing 
to deliver anti-social themas especially if they were stimulated with the 
complex cards 11-20. If the cards were presented in the reversed order, 
however, the more commonplace cards of the first ten might appear dull 


5 There is only one 2 X2 latin square in this context. 
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and conventional after the S had faced the bizarre second ten, and, in 
consequence, he might have tended to give more conventional stereo- 
typed themas.. If any such process takes place, it may constitute a row- 
column interaction effect which could overemphasize any differences in 
the stimulus value of the first and second ten cards of the T.A.T. 

The type of interaction described above can have most serious con- 
sequences. Opportunities for such interactions to arise are frequent in 
psychological experimentation. Whether serious interactions are com- 
monly obtained is hard to say. The experimenter who uses two pro- 
cedures in the two possible presentation sequences must, however, be 
alert to consider such an eventuality in interpreting his experimental 
data. 


INDIVIDUAL DIFFERENCES AND THE LATIN SQUARE 


A very useful application of the latin square to psychological research 
arises in connection with the simultaneous control of individual differ- 
ences and temporal order of presentation of procedures. Just as the 
agricultural researcher faced with the constant problems of soil hetero- 
geneity has turned to the latin square, so the psychologist faced with 
his perpetual problem of individual differences, can also control a diffi- 
cult extraneous variable in his experiments by use of the same device. 
Actually, as Garrett and Zubin (6, p. 242) have pointed out, balanced 
orders of presentation, permuted double-fatigue orders, the classical 
ABBA order, etc. have long been used by careful experimenters, and 
these all embody certain latin square principles. Furthermore, a number 
of excellent papers (summarized in 12) have shown how repeated scores 
and matched groups offer opportunities for reducing the size of error 
estimates to increase the efficiency of experiments. This paper presents 
an extension of these principles well-known to the researcher. 

First, as an artificially simple example, using the 5X5 latin square 
shown earlier, suppose that the dependent variable is the amount of 
aggressive behavior observed in nursery school children in five con- 
trolled social situations. The five social situations constitute the in- 
dependent variable. Because the children may be expected to show con- 
sistent individual differences in aggressiveness, a factor irrelevant to the 
experiment, it is desirable that each child serve as his own control or go 
through each of the five different situations. Furthermore, let us sup- 
pose that there may be some practice or adaptation effect inthe situa- 
tions. This would rule out the procedure of going through the five situa- 
tions in one constant sequence or order. Let the social situations be 
designated A, B, C,D, and E. If five children are available, each child 
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could be assigned to a single row of the latin square above, and run 
through the situations in the order indicated below: 


Order 
I i ae Is 
PoE E B D A 
es 2 (8 A C B D 
. eee D E A C 
ee oe B D C E 
| C A E B 


His aggressive behavior score would then be entered into a table set up 
in exactly the same pattern. In the above arrangement, individual 
differences would then produce inter-row variation with four degrees of 
freedom, practice effect would produce inter-column variation with four 
degrees of freedom, and the experimental factor, social situation, latin 
letters, with four degrees of freedom would be orthogonal to each of 
these. The error variation would be estimated with twelve degrees of 
freedom. 

Computation of the analysis of variance is straightforward. Let 
2X? be the sum of all squared aggression scores. The correction factor, 
C=1/25 (2X)*. Then, letting the totals for Children have arabic sub- 
scripts, the totals for Order have latin numeric subscripts, and the 
totals for Situation have latin literal subscripts, the sums of squares 
are as follows: 

1. SSro= >X?—C. 

2. SScnitéren=1/5(T1?+ 722+ +--+ +732)—C. 

3. SSoraer =1/5(T7?+Tu?+ +++ +Ty?)—C. 

4. SSsituations = 1/5(T4?+ T3?+ ss +Tx’) —C. 

5. SSce = SSrot Kei (SScnitaren + SSorder + SS situations): 


Row Xcolumn interaction or individual differences in practice effect, 
if present, cannot be assessed. It may serve to enlarge or to shrink the 
error estimate and the mean square for the experimental factor. The 
other interactions are likewise confounded so that if certain situations 
favor some children but not all, or if certain situations are reacted to 
differently early in the sequence from the way they are reacted to later 
in the sequence, these interactions cannot be tested. If present, they 
may influence the size of the error term in the analysis of variance. 


COMPLETE SETs OF SQUARES 


The experiment with nursery school children was a direct application 
of the latin square. More commonly the latin square will not be used 
directly, but the latin square principle will prove useful. As an ex- 
ample, consider the following investigation (8). It was desired to study 
the process of learning to learn nonsense syllables. Four experimental 
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procedures were used. (They involved three variations of group motion 
picture presentation versus traditional individual sessions with the 
memory drum.) After the experimental procedures all Ss were given 
the same three successive lists to learn as a test. The number of trials 
necessary to learn to a criterion of two successive perfect repetitions 
was the score. (After pre-training the distributions of these scores were 
not badly skewed.) The three test lists, A, B, and C might be expected 
to vary in difficulty, and moreover, the Ss would still be changing in 
learning skill while memorizing the three test lists. These two factors 
were irrelevant to the experiment and should be controlled in order to 
eliminate the variance they introduced from the error estimate. Atten- 
tion will first be focussed on one of the four experimental groups. To 
balance out list difficulty each of the three lists should be used in each 
of the three positions. A latin square was thus suggested with three Ss 
taking the three lists in three different sequences, but three Ss would 
scarcely be adequate for reliable measures in a rote learning experi- 
ment. To overcome this difficulty all six permutations of the three lists 
—in effect two latin squares—were used. It was therefore necessary to 
use a number of Ss which was a multiple of six; e.g., thirty. Thus the 
data for this group consisted of 90 scores in a 30 row by 3 column table, 
where each row represented a different subject, and the three columns 
represented the first, second and third lists in order of presentation. 

Let a score from the above-mentioned table be designated by Xj, 
the sum of scores for Ss by Ti, To, .. . , T30, the sum of scores for the 
three lists by T4, Tp, and Tc, and the sum of scores for the three 
ordinal positions in the presentation sequence as 7y, Ty, and Tin. The 
sums of squares for analysis of variance were found as follows: 


90 
(1) SSwat= >, X?—C. 89/df 
t=] 
2 


90 
» Xs 





t= 
fice Sib: 5 

30 
(2) SSeudjects nad 1/3 7 T;? —C 29/df® 

t=l 
(3) SStiste = 1/3(T 42+ T,?+ To*)-—C 2/df 
(4) SSorder = 1/30(T1?+ Tir? + Tin?) —C 2/déf 
(5) SSeciir = SS" totat Par (SSeudjects + SStists + SSorder) 56/df 


©The SSrudjects and its 29 df can be broken down into SSsequencee With 5 df and 
SS individuals within sequence With 24 df, using the procedure described by Kogan (10). 
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In this case the sums of squares for the three ‘‘main effects,”’ sub- 
jects, lists, and order would not be inflated by confounded interactions 
even if appreciable interactions were present. The sum of squares for 
error did not contain confounded interaction but was restricted to ‘pure 
error.” This fortunate state of affairs arises because all possible com- 
binations were used an equal number of times. Interactions, however, 
could not be assessed, except perhaps with considerable loss of data 
because of partial confounding. 

The final analysis in the experiment was essentially a “between 
groups’ —‘‘within groups” analysis (11, 14). Each of the other three 
experimental groups was treated in essentially the same way, so that 
four parallel analyses were made. The four group totals were T., Ts, 
T.,, and T;, and the differences between groups was assessed in terms of: 


(Tat+Ts+T,+T;)? 
360 





(6) SSretween groups — 1/90(T 2+ T,?+ T,?+ T;?) ine 


with 3 df. 

The mean square between groups was divided by the pooled error mean 
square which was obtained by adding together the four SSerror for the 
four groups and dividing this error total by 4X56 or 224. Where the 
resulting F was significant the differences between groups are signifi- 
cant. But this test has a rather narrow scope. A second F should be 
computed, using the between groups mean square again as the numerator 
and a pooled Ss within groups mean square as the denominator. (This 
denominator can be set up by adding the four SSsusjects for the four 
groups and dividing by 4X29 or 116.) If this second F is not significant 
the differences between groups may be accounted for in terms of reliable 
differences between Ss. If the second F is significant, it means that the 
differences between groups exceed the differences which might be at- 
tributed to consistent variation between the individual Ss. 

The special virtue of this design was that in addition to balancing 
out extraneous factors such as list difficulty, and order effects, these 
and individual differences (subject variance) were removed from the 
error estimate so that considerable precision was gained. This pro- 
cedure can often be followed, and when complete sets of permutations 
are used, all interactions within such sets of squares cancel out so that 
they are not confounded with the main effects. 


REPLICATED SQUARES 


When it is impossible to use complete sets of permutations in a 
design similar to that described above there remain two options. On 
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the one hand several different randomly selected latin squares might be 
used, or on the other hand, the same square might be used over several 
times. Each procedure has its merits. Using several different squares 
will almost insure that the interactions effects of each square will be 
cancelled out by those of the others. Using one square several times 
enables the experimenter to obtain an error estimate containing only 
certain interactions, and the variables may be such that the experi- 
menter may have grounds to suppose that these interactions are of no 
consequence. 

An example of the repeated use of the same latin square appears 
in a recently completed experiment of Corrigan and Brogden (2). The 
dependent variable was the precision of linear horizontal pursuit move- 
ments which was studied as a function of the independent variable, the 
angle of the path from the line normal to the body of the S. Seven angles 
were used, and the Ss were given tests at these seven angles in seven 
sequences such that each angle occurred once and only once in each 
ordinal position of the sequence. A 7X7 latin square was thus formed 
with Ss or sequences for rows, ordinal position within the sequence of 
columns and angle of pursuit for latin letters. Instead of having one S 
for each sequence, however, four Ss were used for each row, making 28 
in all. This resulted in four replications of the same square. 

The analysis of this experiment is slightly more complicated than 
that of a simple latin square. The total scores for the four Ss in each 
row may first be considered to form a single latin square with entries, T;;, 
where the subscripts refer to rows and columns. Let Ty, Tp, ---,Te 
refer to total scores for the seven angles (letters); 71, Tn, --+-, Tim 
refer to totals for ordinal position within sequences (columns); 
Trt, Tu, +++, Te refer to totals for the seven sequences (rows); and 
T:, Tz, - ++, Tes refer to total scores for the 28 Ss. The sum of all 
squared scores is 2X*, and the correction factor, C=1/196 (2X)*. The 
appropriate analysis is given in Table II. 

The first three rows of Table II yield sums of squares which are 
familiar. The fourth row gives a term which would be the error estimate 
for a single 7X7 square. Since another error estimate is avilable, how- 
ever, the usual error mean square—here called Square Uniqueness—can 
be evaluated to see whether or not it has been inflated or deflated by the 
unique pattern of confounding which occurred in the interactions of 
this particular square. If significant inflation or deflation occurs the ex- 
perimenter is thus made aware of it, although he is still unable to dis- 
cover which interaction or interactions are causing the difficulty. Pre- 
sumably if the SS.guare uniqueness iS inflated or deflated the reverse in- 
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fluence will be present in one or more of the main effects. If the Square 
Uniqueness mean square is not significantly greater than the Error 
mean square, the two sums of squares may be combined to obtain a 
more reliable error estimate. The significance of the fifth mean square, 
that for Individuals within Sequence, provides a test of the reliability of 
the scores used (6, p. 249 f.). The Sequences mean square should be 
tested against the Individuals within Sequences mean square as well as 
against the Error term in order to learn whether a significant F for 
Sequences mean square divided by Error mean square could be ac- 
counted for in terms of significant variation between Ss.? The error 
term itself is free of all variation directly due to Ss, angles, ordinal 
position, sequences, and square uniqueness. It may contain some in- 
teraction variance, but this would be restricted to interactions between 
Individual Ss within Sequences and Angles and Ordinal Position within 
Sequences. Thus by replicating the same square a number of times, a 
more nearly pure error estimate is obtained. 


GRECO-LATIN SQUARES 


An NXN greco-latin square consists of N latin letters and N greek 
letters in each of N rows, forming an N XN square in which each greek 
and latin letter occurs just once in each row and each column, and each 
greek letter occurs just once with each latin letter. The construction of 
such squares is described by Fisher and Yates (5). An example of a 
5X5 greco-latin square is given below: 

Da Ay Ce Bé Eg 
Ae Cg Eé Dy Ba 
Ey Be Dg Ca Aé 


Cé Ea By Ag De 
Bs Dé Aa Ee Cy 


In the greco-latin square, rows, columns, latin letters, and greek letters 
are orthogonal to each other. The analysis of the 5X5 greco-latin 
square is outlined below: 





Source of Variation af 
Rows 4 
Columns 4 
Latin letters 4 
Greek letters 4 
Error 8 
Total 24 


? This test was found to be important in the Corrigan-Brogden experiment. 
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The greco-latin square is admirably suited to experiments such as 
that of Buxton and Ross® (1) in which four lists of nonsense syllables 
were memorized under four experimental conditions by each of 32 Ss. 
In the typical experiment of this type the lists will differ in difficulty, 
the Ss will show a practice effect, and there will be consistent differences 
between Ss in memorizing ability. Effects of all of these irrelevant 
variables should be eliminated from. the error estimate in an efficient 
experimental design if the influence of the four experimental conditions 
is to be properly evaluated. 

Simultaneous elimination of all of these extraneous sources of varia- 
tion may be accomplished by using eight different 4X4 greco-latin 
squares, which could be arranged in a column to form a 32 row by 4 
column table. Then rows would yield individual differences between 
Ss; columns would be ordinal position in the sequence of procedures, 
yielding practice effects; latin letters would be assigned to experimental 
conditions; and greek letters would be assigned to lists. The analysis of 
variance is outlined below: 


Source of Variation df 
(1) Experimental condition (latin letters) 3 
(2) Variation in list difficulty (greek letters) 3 
(3) Order or practice effect (columns) 3 
(4) Individual differences between Ss (rows) 31 
(5) Error 87 
(6) Total 127 


The sums of squares are obtained as before. The Individual Differences 
sum of squares will be equal to one-fourth of the sum of the 32 squared 
individual totals minus the correction factor. The SSzrrcr will be the 
SStotat Minus all other sums of squares. 

This design is unusually efficient in that individual differences are 
effectively removed from the error estimate. Because individual differ- 
ences are orthogonal to the variation produced by the experimental 
factor, each S serves as his own control through all four procedures. 

One last experiment will be outlined to illustrate how the principles 
of the factorial experiment may be combined with principles of the 
greco-latin square. In this well-planned investigation (13) three experi- 
menters ran nine monkeys through a nine-week experimental program 
under three conditions of motivation. Each week each monkey was 
trained on a different list of essentially similar problems. By use of a 


® Buxton and Ross obtained an effective double-elimination of individual variability 
by means of analysis of covariance. The greco-latin square procedure described here is 
an alternative design which they did not actually use. 
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9X9 greco-latin square it was possible to eliminate the effects of ex- 
perimenters, monkeys, motivating conditions, problem lists, and 
weeks. The experimental design is presented diagramatically in 
Table III. In this design the successive weeks of experimentation are 
assigned to rows, the motivational conditions and experimenters are 
assigned to columns, the monkeys are assigned according to the Jatin 
letters, and the lists of problems are assigned according to the greek 
letters. 


TABLE III 


Use oF 9X9 GRECO-LATIN SQUARE IN EXPERIMENT ON ROLE OF 
REVERSAL LEARNING MOTIVATION IN DISCRIMINATION 








I | I | Ill Row 


Motivation 








Experimenter | Don June Oscar| Don June Oscar} Don June Oscar/ Totals 

















1 Cr Aé He | Dy Be Fs Gé Ea In Ti 
2 Bn Cé Ga | Fé Ac E¢ ly De Hg T2 
3 Gt Hn Fy | Br 1@ Ae Eg Cé Da Ts 
4 la Gs Eé An Hy Ce Dr Be Fe Ts 
Week 5 He I¢ De | Cp Gé Ba Fn Ay Ex Ts 
6 Dé Ee Cn Ha Fr Gy Bu Ig Aé Ts 
7 Ey Fa Ag le Dg Hé Ce Gn Bé Tr 
8 Fe Di BBs Ge En Ié Aa He Cy Ts 
9 Ag By le E@ Ca Dn Hé Fi Gt Ts 
Col. Totals ae Tb as Ta pie Te Te Ta Ti =X 








Monkeys designated by latin letters. 
Problem lists designated by greek letters. 


The columns are used in an interesting manner. Each of the three 
experimenters tests under each of the three motivational conditions so 
that the two factors, motivation and experimenter, are completely 
orthogonal to each other, forming a three by three factorial experiment 
within the larger structure of the experiment. This means that the 
motivation X experimenter interaction can be evaluated. The greco-latin 
square enables the experimenter to control the variation due to the 
extraneous factors, of: (a) individual differences between monkeys; (b) 
variability in difficulty of the nine lists of problems; and (c) general 


* It should be noted that if Meyer had tried to use a 10 X10 square, he could not have 
formed a greco-latin square. The same is true of the 6X6, 1414, 1818 squares, etc. 
For this reason, when several factors are to be controlled simultaneously in a greco-latin 
or hyper-greco-latin design, these squares should be avoided. 
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practice effects which might cause week to week improvement in the 
performance of the animals. 

The analysis of variance with the procedures for calculating the 
sums of squares of the data of the experiment is outlined in Table IV. 
The eight df for columns is split up, two for experimenters, two for 
motivational conditions, and four for the interaction of these two 
factors. The design is highly efficient and should give a fairly pure error 
estimate. The purity of the error term illustrates a good feature of the 
larger randomly selected latin squares. The larger latin squares are 
rarely used in agricultural field designs because as they become large, 
the increased efficiency does not compensate for their high cost. For 
psychological work, however, the larger latin squares are just as efficient 
as the small ones, and, in addition, will usually yield purer variance 
estimates. The use of factorial design in the columns, as in the example 
above, or the rows, or in both, is also more convenient in the larger 
squares. 


CONCLUDING REMARKS 


This brief paper was not intended to exhaust the variety of fruitful 
applications of the latin square to psychological research. It is clear 
; that whenever an experimenter is faced with the problem of balancing 
out effects of the temporal order in which two or more procedures are 
to be followed with the same Ss he should consider the use of the latin 
square principle. For further information the psychologist will find 
good general descriptions of the latin square in these references (3, 4, 
7, 14, 16, 17), but specific psychological applications have as yet ap- 
peared only rarely (6, 15). It seems reasonable, however, to predict that 
the latin square design will prove just as valuable to the psychologist 
as the factorial experiment, and that further psychological applications 
of the latin square will soon be forthcoming. 
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KINSEY’S “SEXUAL BEHAVIOR IN THE HUMAN MALE”: 
SOME COMMENTS AND CRITICISMS! 


LEWIS M. TERMAN 
Stanford University 


Like others, the reviewer has been deeply impressed by the magni- 
tude and potential significance of Kinsey’s research. From the first 
volume of his projected series it is obvious that no one has ever obtained 
so much information from so many persons regarding the most secret 
phases of their sexual histories. The advance publicity given the book 
had prepared for it a hearty welcome, and a cursory examination of its 
contents tended to confirm the favorable opinions previous reviewers 
had expressed. However, a careful reading and rereading of the report 
has raised so many questions that it has seemed desirable to publish 
the following comments in the hope that they may (1) lead others to 
examine the book more critically, and (2) have a beneficial effect upon 
the treatment and exposition of data in the volumes which are to follow. 

It would be premature to attempt a general appraisal of Kinsey’s 
entire investigation on the basis of this progress report. The comments 
here offered are not an all-round appraisal of even the first volume, since 
they have dealt almost entirely with its shortcomings and inadequacies. 
The reviewer has felt justified in confining his comments chiefly to the 
demerits of the report because its merits have been recounted so ex- 
tensively by others.’ 

Scope and validity of the basic data. All of the data were obtained 
through personal interviews, as Kinsey has no confidence in the anony- 
mous questionnaire. It is probably true that some of the 300 to 500 
items of information called for in his interviews could not have been ob- 
tained by any kind of questionnaire with comparable accuracy. On the 
other hand, it is conceivable that some of the information would have 
been more accurately reported had a method been used which pre- 
vented the investigator from learning the identity of any of his respond- 
ents. Be that as it may, Kinsey has chosen the slower and the harder 


1 Kinsey, A. C., Pomeroy, W. B., and Martin, C. E. Sexual behavior in the human 
male, Saunders, 1948. Pp. 804. 

(As Kinsey is responsible for the general plan of the investigation, gathered most of 
the data, and presumably wrote the report, the reviewer has omitted the names of 
Pomeroy and Martin in references to authorship.) 

* See especially: Sex habits of American men, A symposium on the Kinsey Report, 
edited by Albert Deutsch, Prentice Hall, 1948; American sexual behavior and the Kinsey 
Report, by Morris Ernst and David Loth, Greystone Press, 1948; and About the Kinsey 
Report, edited by D. Geddes and C. Curie, New American Library, 1948. 


443 








444 LEWIS M. TERMAN 


way. One can only marvel at the zeal and perseverance of an investiga- 
tor who would undertake to carry through 100,000 interviews, each re- 
quiring on the average 90 minutes or more. 

The amount of information obtained is surprisingly great for a single 
interview, covering, as it does, not only details about current sexual ac- 
tivities, but in equal detail the earlier activities of the subject as far back 
as memory can recall them. Unfortunately, the author tells us almost 
nothing about the wording of the questions asked, a matter which the 
professional pollsters have found to be extremely important. The rea- 
son given for this omission is lack of space, but since the wording of ques- 
tions vitally affects the interpretation of almost every statistic in this 
800-page book, the omission is regrettable. 

What the author does say about the questions is not always reassur- 
ing. In the first place, we are told that they have never been stand- 
ardized; instead, the manner of wording them varies according to the 
age, intelligence, and personality of the subject being interviewed. The 
necessity of alternative forms of wording will be granted, but without 
knowledge of the forms deemed permissible, no other investigator can 
repeat the Kinsey experiment with any assurance that he is getting 
comparable results. The two assistants (Pomeroy and Martin) had un- 
dergone a year of training in interview methods, yet each obtained re- 
sults which at various points differ reliably from those of the other and 
from Kinsey’s. This is true despite the fact that sex, race, marital status, 
age, education, religion, and rural-urban residence were held constant 
for the three groups of interviewees compared. 

Consider, for example, the mean frequencies found by the three inter- 
viewers in the age groups adolescence to 15, and 16-20. The means given 
(p. 134) are for total outlet, masturbation, nocturnal emission, premari- 
tal coitus, and homosexual! contact. This gives ten comparisons of 
means in the two age groups for the total population and ten additional 
comparisons for the active population (that is, the population which has 
practiced a given sexual activity). In nine of the ten comparisons for 
the total population the Kinsey mean is highest, and in most cases re- 
liably so. For premarital coitus in the age group 11-15, Kinsey’s figure 
is three times that of Pomeroy and more than twice that of Martin. 
For homosexual contact at this age Martin’s figure is less than one- 
fourth as high as either Kinsey’s or Pomeroy’s. In age-group 16-20 
Kinsey’s figure for premarital coitus is nearly twice that of Pomeroy, 
and for homosexual contact it is four times that of Martin. Differences 
of similar magnitude are found in the ten sets of mean frequencies fo1 
the ‘active population.® 

* The generally higher frequencies found by Kinsey are said to be due to the fact that 
some of the more promiscuous and difficult subjects were assigned to him for interview. 
However, we learn from Table 22 and Table 23 that there are reliable differences between 
Kinsey's own data of 1938-1942 and his data of 1943-1946, 
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A questionable feature of the interview technique is the author’s 
practice of always placing the burden of denial upon the subject. The 
practice is defensible with a majority of subjects, but it could easily in- 
validate the reports of children and of feeble-minded or low-level adults. 
‘“‘We always assume,’’ says Kinsey, ‘‘that everyone has engaged in every 
type of activity. Consequently we always begin by asking when they 
first engaged in such activity” (p. 53). Yet, in the second paragraph 
preceding this statement the author warns that “In his tone of voice 
and in his choice of words the interviewer must avoid giving the subject 
any clue as to the answers he expects.’’ On p. 55 the author describes a 
technique for what he calls ‘‘proving the answer,” which is said to be 
useful with uneducated persons and the feeble-minded. The method is 
“to pretend that one has misunderstood the negative replies and ask 
additional questions, just as though the original answers were affirma- 
tives....’’ Example. ‘Yes, I know you have never done that, but 
how old were you the first time that you did it?” Anyone familiar with 
the experimental literature on suggestibility will wonder about the pos- 
sible effects of this technique. 

The author is little concerned about the danger of fabrication; that, 
he believes, can be taken care of by ‘‘Looking an individual squarely in 
the eye, and firing questions at him with maximum speed...” (p. 
54). Cover-up, the author says, is harder to catch, and its possible in- 
fluence on the incidence figures obtained is specifically mentioned in a 
number of places. For example, on p. 499 he says that “while college 
men more often admit their experience [of masturbation], there are 
males in some other groups who would admit almost any other kind of 
sexual activity before they would give a record of masturbatory experi- 
ence.”’ Elsewhere he states that histories on socially taboo items gen- 
erally are difficult to secure from older married subjects of superior 
levels. On p. 54 it is said that cover-up is combatted by “the use of a 
considerable list of interlocking questions which provide cross-checks 
throughout the history, and particularly in regard to socially taboo 
items.’’ However, the author is not very explicit about the exact nature 
of these cross-checks, and the examples given do not impress this re- 
viewer as altogether convincing. 

An additional source of error is the long-distance memory report. 
The author tries to check on the extent of such errors by retakes on 162 
subjects and by noting the extent of husband-wife agreement in 231 
married couples who gave memory reports on the same facts. The re- 
takes were made after intervals ranging from 18 months to 7 years 
(mean, 38.5 months). Table 13 shows the take-retake correlations to 
be very high for vital statistics data and for percentages who had en- 
gaged in specified sexual activities. On frequencies of outlet, the corre- 
lations were much lower, ranging from .58 to .67, and in most cases they 
were similarly low on reported age when given sexual activities first oc- 
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curred. For example, on age at first ejaculation the two reports dis- 
agreed by more than one year for 52.4 per cent of subjects (r = .58). On 
reported age at first nocturnal emission 76.2 per cent of subjects showed 
disagreement of more than a year (r =.54). In general, the correlations 
on frequencies and on age at first experience are not high enough to per- 
mit very reliable comparisons with other variables. Moreover, as the 
author admits (p. 125), the retakes do not test the validity of the data, 
but rather the constancy of memory and of tendency to cover-up. 

Regarding the comparisons of reports by the 231 married couples, 
Kinsey says ‘‘the record shows an amazing agreement’”’ between the 
statements of husbands and wives. However, examination of Table 14 
shows that most of this ‘‘amazing agreement”’ is on 12 items of vital sta- 
tistics, such as number of years married, length of premarital acquaint- 
ance, length of engagement, age at marriage, number of children, 
amount of education, occupation of father, etc. The husband-wife cor- 
relation was only .50 for average frequency of coitus in early marriage, 
.54 for maximum frequency, and .60 for average frequency ‘‘now.’’ Of 
the 14 correlations which are .80 or higher (out of the total of 32), 10 
concerned vital statistics and the others concerned the practice of coital 
foreplay and intercourse in the nude. 

Unable to correct for errors of memory, the author proceeds sta- 
tistically as though they did not exist; that is, he gives the same weight 
to reports based upon remote recall as he gives to reports of current ac- 
tivities. In the computation of mean frequency of masturbation at age 
15, for example, the memory report of a 50-year-old counts as heavily as 
the report of a 15-year-old. 

The validity of Kinsey's sampling. The problems of sampling in an 
investigation of this kind are of paramount importance. If every re- 
sponse by every subject were completely truthful, the resulting data 
could still be misleading if the groups interviewed were not representa- 
tive of their kind. Indeed, representativeness is incomparably more im- 
portant than sheer numbers, for however numerous the subjects inter- 
viewed, if the sampling is biased the generalizations will be biased. An 
advertising circular from the publisher states that the interviews were 
“conducted with full regard for the latest refinements in public opinion 
polling methods.’’ However, Kinsey’s discussion of his sampling pro- 
cedure on p. 92 ff. makes it clear that no scheme of randomization like 
those common in public opinion polls has been used in this study. He 
depended instead on huudred-per-cent samples of certain groups and 
upon diversification of the total population by the addition of whatever 
subjects were available for interviewing. The latter were subjects who 
volunteered or could be persuaded to cooperate, and are said (p. 95) to 
constitute 74 per cent of the 12,000 males and females who have been 
interviewed to date. 

Kinsey is not to be criticized for not using the methods common in 
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public opinion polls; as he points out, a strictly random selection of sub- 
jects in a study of sexual behavior would not have been feasible. The 
report is open to criticism, however, for not giving us the information 
needed to judge the representativeness of either the volunteers or the 
hundred-per-cent samples. The N’s of contributing groups are almost 
never stated. Hundred-per-cent samples were obtained from 62 groups, 
of which 42 were of college level. Seven of the remaining 20 groups were 
institutional cases in four delinquent groups, two penal groups, and one 
group in a “mental” institution. There were three classes of junior high 
school students, three speech-clinic groups, three rooming-house groups, 
two groups of conscientious objectors, a group of N.Y.A. workers, and 
a group of hitch-hikers. Whatever the N’s of these individual groups 
may have been, it is unlikely that the total hundred-per-cent sample 
could have been representative of the U. S. population, or could have 
been made so by any kind of statistical doctoring. 

The information given about the volunteers is equally incomplete. 
We are told (p. 38) that about half of the 12,000 histories to date have 
been obtained through contacts resulting from several hundred lectures 
by the author to groups numbering in all perhaps 50,000 persons. We 
do not know how those who attended the lectures differed from those 
who might have attended but did not, nor how the 6,000 who heard the 
lectures and allowed themselves to be interviewed differed from the 
44,000 who heard them but did not cooperate. The author lists (p. 39) 
32 groups of “‘contact’’ persons, numbering ‘“‘many hundred” in all, who 
heiped in obtaining volunteers. Seven of these 32 were delinquent 
groups: male prostitutes, female prostitutes, bootleggers, gamblers, 
pimps, prison inmates, thieves and hold-up men. These, presumably, 
would have brought in others of their kind, but in what numbers they 
did so we are not told. Elsewhere (p. 15) the author lists a dozen prison 
populations which, he says, “have augmented our understanding of 
economically and educationally lower social levels, and of the broken 
marriages which are in the histories of a high proportion of the penal 
inmates.’’ Additional institutions mentioned which were not penal or 
correctional included a state school for feeble-minded, two children’s 
homes, and two homes for unmarried mothers. On p. 16 we learn that 
subjects were obtained from “homosexual communities” in Chicago, 
New York, Philadelphia, Indianapolis, and St. Louis; also from ‘‘under- 
world communities” in Chicago, Peoria, Indianapolis, New York City, 
and Gary (Indiana). 

On p. 392 Kinsey states that he has data on more than 1,200 persons 
who have been convicted of sex offenses. We are not told how many of 
the convicted sex offenders are included in the population of 5,300 white 
males for whom data are summarized in this volume. On p. 210 we 
learn that data on frequencies of penal groups while in prison were not 
included in the frequency calculations, but their memory reports of 
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sexual activities prior to their imprisonment were presumably included 
along with the data from other subjects. The scanty and scattered in- 
formation available warrants the suspicion that Kinsey’s educationally 
low-level groups may have been far from typical of this level in the gen- 
erality. The suspicion is strengthened by the information (p. 213) that 
49.4 per cent of the underworld males have a mean frequency of outlet 
that is equalled by only 7.6 per cent ‘‘of any population.” 

Similar questions arise about the representativeness of nearly all the 
sub-groups, except possibly the group which had attended college one or 
more years, and which, incidentally, made up more than half of his 
5,300 white males. Little information is given as to the source of the 
9-12 educational group, and even less about the source of the rural 
group. The rural group is vaguely defined (p. 451) as including those 
subjects who spent any “appreciable portion of the years between 12 
and 18” on an operating farm. Specific mention is made of interviews 
carried out over a period of years in certain ‘‘remote’’ and “‘isolated’”’ 
rural communities, but what they were like, or how many subjects they 
furnished, we are not told. It is unfortunate that the author did not ad- 
here to the excellent rule which he laid down on p. 33 to the effect that 
“Each segment which is studied must be precisely delimited, and all 
conclusions must be confined to such precisely defined groups.” 

Other fragments of information that have a bearing on the sampling 
are found throughout the book. On p. 544 mention is made “‘of the six 
thousand marital histories in the present study, and of nearly three 
thousand divorce histories ....’’ These, presumably, are in the twelve 
thousand histories collected to date from males and females. Surelya 
population in which the divorce histories are half as numerous as the 
marital histories provides a shaky foundation for a census of sexual be- 
haviors. 

One question regarding the representativeness of Kinsey’s sampling 
is whether the subjects who volunteered, and who account for about 
three-fourths of his total population, tended to be of a special sort. One 
might suppose that persons most willing to talk about their sex lives 
would be, in a disproportionate number of cases, those least inhibited in 
their sexual activities. On p. 37 Kinsey says that many who volunteered 
did so because they were seeking information or help in connection with 
their personal problems. The best way to check on the representative- 
ness of the volunteer sample would have been to compare it directly 
with the hundred-per-cent sample. Kinsey does not do this, but he does 
(on pp. 94-102) compare the hundred-per-cent sample with what he 
strangely calls the “partial sample,’’ the partial sample being defined in 
a footnote as the hundred-per-centers plus the volunteers. That is, his 
comparisons are really between the hundred-per-cent sample and his 
complete sample. 

The results of these comparisons are summarized in Table 3, which 
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gives for single males of college level the mean and median frequencies 
of six types of sexual outlet for three separate age groups: adolescence to 
15, 16-20, and 21-25. The figures show that at all age levels, and for all 
kinds of outlet except nocturnal emissions, the means run consistently 
higher for the complete sample. Many of the differences are very large 
and highly reliable despite the fact that the complete sample includes 
also his hundred-per-cent sample. From the data given in the table Dr. 
Quinn McNemar has computed the means separately for the subjects 
who volunteered. When these are compared with the means of the 
hundred-per-cent sample the differences are of course greater than those 
shown in the table. For premarital coitus the difference in mean fre- 
quency by this method of comparison is nearly 2 to 1 at adolescence to 
15 (actual means .09 and .05), somewhat less than 2 to 1 at ages 16-20 
(actual means .31 and .18), and about 3 to 2 at ages 21-25 (actual means 
.50 and .36). For homosexual contacts the difference as thus computed 
becomes nearly 2 to 1 at adolescence to 15, more than 3 to 1 at 16-20, 
and exactly 4 to 1 at 21-25. Differences of such magnitude confirm the 
suspicion that willingness to volunteer is associated with greater than 
average sexual activity. And since the volunteers account for about 
three-fourths of the 5,300 males reported upon in this volume, it follows 
that Kinsey’s figures, in all probability, give an exaggerated notion of 
the amount of sexual activity in the general population.‘ 

Notwithstanding the fact that all but a small fraction of the subjects 
interviewed resided in five mid-western states, five middle-Atlantic 
states, and two New England states, Kinsey ‘‘corrects”’ his findings (for 
such factors as education, age distribution, marital status, rural-urban 
residence, et cetera) to show incidences and frequencies for various sub- 
groups on each kind of sexual outlet in the entire U. S. population. It is 
evidently the intention of the author to spread his interviews in the 
years to come more or less equally throughout the country in proportion 
to the population of the individual states, but even when this is done 
there will still remain the problem of obtaining in each area a represen- 
tative sampling for each of his twelve major break-down groups. In view 
of the inadequacies of the sampling to date, the “‘corrections’’ to show 
hypothetical incidences and frequencies for the total U. S. population 
seem to this reviewer indefensible at the present stage of the investiga- 
tion. 

One of the most puzzling omissions in the book is the author’s failure 
to give the complete age distribution of his subjects at the time they 
were interviewed. Mention is made of subjects who were in their 70’s 
and 80’s, but because of small N’s in the upper age brackets the data for 


‘This conclusion is based on the comparisons between the volunteers and the 
hundred-per-centers of college level. The author does not compare the two kinds of 


samples for the 0-8 or 9-12 educational levels because of the relatively small N’s in these 
groups. 
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most types of outlet are not summarized for age groups above 40 or 45 
years. The reviewer has found no statement about the lowest age limit 
of the younger subjects; incidences and frequencies are summarized for 
some types of outlet down to the age of 8 years, but we are not spe- 
cifically informed whether any data obtained from 8-year-olds (or for 
that matter from 9-, 10-, 11-, or 12-year-olds) have been reported for 
these 5,300 males. As previously stated, the author throws together the 
memory reports and the reports of current activities, giving the two 
kinds of data equal weight. It would have been helpful if he had shown in 
the tables what proportion of the N at any given age level was accounted 
for by subjects at or near that age. This proportion would be high in the 
late teens and early twenties, because of the large (though unstated) 
number of students who were interviewed while they were attending col- 
lege; but for all we know, the proportion may be zero for ages 8 to 12 or 
even later, that is, the data for the lower ages in the tables may all be 
based on the memory reports of older subjects, many of them 20, 30, or 
40 years older. 

The sample-size experiment. Kinsey gives the results of a series of 
drawings of random samples from his total population which were in- 
tended to establish the size of sample needed to give stable incidences 
and frequencies for the different kinds of sexual activity. The reviewer 
asked Dr. Quinn McNemar to check over the statistical procedures used 
in this experiment. His report follows. 


In all, 40 pages (82-92; 736-765) are devoted to determining, empirically, 
what size of sample for each sub-group is desirable and adequate. This empirical 
approach involved drawing sub-samples of varying sizes from available total 
samples (groups) and noting what N, in general, led to values for means, medi- 
ans, modes, modal frequencies, and ranges which were within five per cent of 
the magnitudes of the respective values for the total groups. For incidence per- 
centages, the criterion was that sub-sample percentages should fall within two 
per cent of the magnitudes of the respective total group percentages. This pro- 
cedure, which required a total of 4,279 comparisons of sub-samples with total 
sample values, involves some serious fallacies. 

(1) Failure to recognize the fact that the sampling stabilities of means, 
medians, and modes are not a function of their magnitudes, but rather of trait 
variability. Means, for example, of the same size can, and do, have differing 
standard errors for constant N when “score” variation differs. 

(2) Failure to consider the fact that these three statistics differ markedly 
from each other in their sampling errors even when computed from the same 
distribution or from distributions of similar variability. For a normally dis- 
tributed trait 57 per cent more cases are required to obtain a median with a 
standard error equal to that of the mean. In general, the shape of a distribution 
affects the relative sampling stability of the median and mean. 

(3) Failure to observe the fact that the sampling stability of percentages is 
not a linear function of their magnitudes but rather of their degree of remoteness 
from 50 per cent. For constant N, the standard error of 90 per cent is the same 
as the standard error of 10 per cent, whereas the criterion of 2 per cent of mag- 
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nitude would, if correct, imply that the error for 90 per cent is 22 times (5% of 
90% divided by 2% of 10%) as large as the error for 10 per cent! 

(4) Failure to note that convergence of sub-sample values to total group 
values must be more rapid when sub-samples are drawn from small (finite) 
groups than when drawn from larger groups. For instance, one would expect a 
sub-sample of 400 drawn from a total group of 481 to yield a value very near the 
total value, and likewise for 300 from 375, but a sample of 300 or 400 drawn from 
1,513 or from 2,762 leaves room for considerable variation. The ignoring of this 
fact led to the puzzling conclusion (p. 85) that sample sizes greater than 400 
“fail to show any consistent improvement,” as one would expect “‘by standard 
statistical theory.’’ Now it happens that two-thirds or 440 of the 668 samples of 
size 400 are sub-samples which include from 66 per cent to 87 per cent of the 
total cases from which they were drawn, whereas none of the samples of size 600 
exceeds 60 per cent of the supply and two-thirds of the samples of 600 include 
less than 40 per cent of the supply. Under these circumstances and in light of 
that part of standard statistical theory concerned with sampling from finite 
universes, one would expect samples of 400 to appear more adequate than 
samples of 600. 

In brief, incognizance of four elementary statistical principles renders 
worthless this elaborate effort to determine how large N should be for a sub- 


group. 


Effects of early and iate puberty. Some interesting material is sum- 
marized on the relation of sexual activity to age at onset of adolescence. 
The criteria used to establish this age are described as follows on p. 299. 


In the present study, the time of onset of adolescence has been fixed as the 
date of the first ejaculation, unless there has been evidence that ejaculation would 
have been possible at an earlier age if the individual had been stimulated to the 
point of orgasm. When the year of first ejaculation coincides with the year in 
which the first pubic hair appears, and with the time of onset of rapid growth in 
height, and/or with certain other developments, there is no question that that 
year may be accepted as the first year of adolescence. Eighty-five per cent of all 
male histories fall into this category. On the other hand, if the first ejaculation 
follows these other events by a year or more, and ff it is clear that there was no 
test of the individual's sexual capacity prior to the first ejaculation, and if there 
seems to be no question of the reliability of the memory in regard to the dates of the 
other adolescent developments, then the age of onset of adolescence is better 
established by events other than ejaculation. When first ejaculation occurs as 
a nocturnal emission, it usually (though not always) does not come until a year 
or more after the appearance of the other adolescent developments, and the 
onset of adolescence should be set a year or more before the first ejaculation. 
(Italics by reviewer.) 


Apart from the fact that the rules here set forth are involved and 
“iffy,” the reviewer doubts whether any man ten or twenty years be- 
yond adolescence could give more than a wild guess as to his age at first 
ejaculation, at first appearance of pubic hair, or at onset of rapid 
growth. It is not merely a matter of memory; each of these signs of 
adolescence makes its appearance gradually. Consider the “unless” 
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clause in the first sentence. What would be good evider ‘e that ejacula- 
tion could have occurred earlier if suitable stimulatio1. had been pres- 
ent? And how could the interviewer be sure whether there was or was 
not a test of the subject’s sexual capacity prior to first ejaculation, or 
whether the subject’s memory in regard to the other signs of adolescence 
is or is not reliable? Such judgments would seem to call for a kind of 
occult insight that most people don’t have. 

It is surprising that memory reports of such events, erroneous as they 
often must be, should reveal significant differences between early- and 
late-maturing subjects for both frequencies and incidences. On p. 307 
it is stated that ‘‘The effect persists throughout the lives of the married 
males, as far as data are available.’”” However, the N’s are so low at the 
later years that for the educational levels 0-8 and 9-12 no data on this 
variable are given beyond age 25. By combining the three educational 
levels the author is able to carry the data, with some fairly satisfactory 
N’s, to age 40. It should be noted, however, that the mixing of educa- 
tional levels is a procedure which he severely criticizes in others. The 
figures on p. 306 show that in frequency of total outlet the ratio of early- 
to late-maturing married males is more than 2 to 1 at age 16-20 and 
about 3 to 2 at 21-25 and 26-30, but that thereafter the ratio fluctuates 
in the neighborhood of 1 to 1. Accordingly, the author’s genera!ization 
about the persistence of the difference throughout the married life of the 
subjects is hardly warranted by the data presented. 

The author suggests that both early puberty and continued high 
frequency of outlet are probably functions of the general metabolic 
level, although no metabolism tests of his subjects are reported. In line 
with this is his belief that the ‘‘early-adolescent males are more often 
the more alert, energetic, vivacious, spontaneous, physically active, 
socially extravert, and/or aggressive individuals in the population,”’ 
and that, conversely, the late-maturing are more often “slow, quiet, 
mild in manner, without force, reserved, timid, taciturn, introvert, and/ 
or socially inept...” (pp. 325-326). He bases this conclusion on per- 
sonality ratings recorded at the time the interviews were made. Psy- 
chologists who have found it so difficult to devise reliable measures of 
such personality traits will be interested in these ratings. 

Another of Kinsey’s conclusions (p. 325) is that the sexual capacities 
do not seem to be impaired as a result of their early initiation and con- 
tinued high frequency in the early-maturing male. On p. 323 he reports 
for 69 older impotent males a plus correlation of .30 between age of on- 
set of adolescence and the age of onset of impotence, but he interprets 
this coefficient as indicating ‘‘that there is in actuality no significant 
correlation’ and that impotence is as likely to occur by a given age 
among the early-maturing as among the late-maturing. However, the 
correlation is significant at the .01 level, by the small-sample tech- 
nique, which suggests a definite tendency for the earlier-maturing and 
high-frequency males to become impotent earlier. 
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The stability.of sexual patterns in two generations. The author gives 
extensive data on the stability of sexual patterns from one generation 
to another. He divided the entire male sample into two groups. One 
group included those who were 33 years of age or older at the time they 
contributed their histories; its median age was 43.1 years. The other 
group included all who were younger than 33, and its median age was 
22.2 years. The two groups have been compared on incidence and fre- 
quency of every type of outlet for each age from 8 to 33 or 34 years. The 
author’s conclusion from these comparisons, as stated on p. 397, is that 
“In general, the sexual patterns of the younger generation are so nearly 
identical with the sexual patterns of the older generation in regard to so 
many types of sexual activity that there seems to be no sound basis for 
the widespread opinion that the younger generation has become more 
active in its sociosexual contacts.’’ The critical reader will want to 
scrutinize the tables carefully to learn the extent to which the data jus- 
tify this conclusion. 

Note first that the subjects in the two groups do not all belong to 
separate generations. Their age distributions are adjacent, and many of 
each group must have differed in age only a few months to a few years 
from many in the other group. We can be sure, therefore, that whatever 
differences the figures yield would have been much greater if there had 
been an age gap of 10 or 20 years between the groups. Secondly, it should 
be noted that the reports by the older group are based on much more 
remote recall,—on the average some 20 years more remote. 

With these limitations of the data in mind, we turn next to the tabu- 
lated reports of the subjects. At the educational level 13+ we find (p. 
400) that the incidence of premarital intercourse is considerably higher 
for the younger group from age 19 to 24 inclusive; for most of these ages 
it is one-fifth to one-seventh higher. At the 0-8 educational level the 
accumulative incidences for most ages run reliably higher in the younger 
group for all types of outlet except intercourse with prostitutes. At this 
level the incidence of premarital intercourse is four times as high for the 
younger group at age 12, three times as high at age 13, one-half higher 
at 14 and 15, and one-third higher at 16. At this educational level the 
incidence of petting at ages 12 and 13 is three times that reported by the 
older group, and at 14 it is nearly twice as high. In this 0-8 group the 
incidence of masturbation runs significantly higher in the younger group 
from 11 to 30 years; at 12 and 13 it is close to twice as high. In both the 
0-8 and the 9-12 educational levels the incidence of homosexual con- 
tacts runs from one-and-one-half times as high to nearly twice as high 
for the younger generation, and among married men of these two levels 
the incidence of extramarital intercourse is in all groups between one- 
third and one-half higher for the younger generation. 

The outlet frequencies (p. 410) show still greater differences between 
the two generations. For example, premarital intercourse in the 0-8 
group has a mean frequency about twice as high in the younger genera- 
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tion as in the older. In the 9-12 group the difference is less but is sta- 
tistically significant. Homosexual frequencies at both the 0-8 and 9-12 
levels run consistently close to twice as high for the younger generation. 
Among married males of the 0-8 educational level, extra-marital inter- 
course occurs in the younger generation from two to five times as fre- 
quently as in the older generation. 

In view of the above differences the author’s assertions that ‘“These 
comparisons of the sexual activities of older and younger generations 
provide striking evidence of the stability of the sexual mores’’ (p. 414), 
and that ‘‘There has not even been a material increase or decrease in the 
incidences and frequencies of most types of activity” (p. 415), are much 
too sweeping. 

The influence of occupational level. Kinsey presents considerable data 
on incidences and frequencies for subjects who have moved up or down 
from the parental occupational class. On p. 419 he summarizes this ma- 
terial as follows: ‘‘In general, it will be seen that the sexual history of the 
individual accords with the pattern of the social group into which he 
ultimately moves, rather than with the pattern of the social group to 
which the parent belongs. ...’’ Examination of the tables reveals that 
there are many exceptions to this rule. The data presented graphically 
on p. 444 are not typical of all ages, all occupational groups, or all inci- 
dences and frequencies. 

It is largely the material on vertical mobility that leads Kinsey to the 
generalization, frequently reiterated, that a subject’s life-long patterns 
of sexual behavior are well established before the age of 16. An examina- 
tion of the figures for the various kinds of activity at successive age levels 
supports this generalization only in part. There is some tendency for the 
subject who moves to a high occupational class from a lower parental 
class to show by the age of 15 the types of sexual behavior characteristic 
of the higher class, but the extent to which this is true varies with the 
different kinds of outlet; and even for a particular type of outlet the rule 
does not hold equally well for migrants from all parental classes. 

A factor which could invalidate the author’s data on this issue is the 
possibility that subjects in the various social classes may not report with 
equal accuracy their sexual activities during the early teens. That this 
factor may have entered is suggested by the mean frequencies of noc- 
turnal emissions (pp. 424-425). Why, for example, should subjects of 
parent-class 4 (skilled laborers) have only one-sixth as frequent noctur- 
nal emissions before age 16, if they stay in the parent class, as they have 
if they are destined to move up to the professional class? And why 
should the subjects of parent-class 3 (semiskilled labor) have them only 
one-third as frequently before 16, if they stay put, as they have if they 
will later enter a profession? Such differences in a type of sexual behav- 
ior that is non-volitional render suspect all of Kinsey’s data on outlet 
as related to occupational mobility. 
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The influence of religion on sexual behavior. In several passages Kin- 
sey seems to regard the sex drives as forces of nature that will find their 
outlet regardless of measures taken to curb them. On p. 269 he says: 
“|, . it is clear that there is a sexual drive which cannot be set aside for 
any large portion of the population by any sort of social convention.”’ 
In this connection the data presented on the sexual activity of active 
and inactive Protestants, of devout and inactive Catholics, and of 
orthodox and inactive Jews are interesting. The value of these com- 
parisons is, somewhat limited by the small N’s for Catholics and Jews, 
and by the author’s failure to describe adequately his method of classi- 
fying a subject as religiously active or inactive, but in spite of these limi- 
tations the data suggest that religious attitudes have a considerable 
influence on most types of sexual activity. In almost every comparison 
the religiously active groups have lower incidences and frequencies than 
the religiously inactive groups of the same denomination. Unless reli- 
gion merely attracts persons of low sex drive (which is doubtful), it 
would seem that religious attitudes exert a definitely restraining influ- 
ense. 

Generalizing beyond the data. As previously noted, Kinsey not infre- 
quently makes unqualified statements which go beyond his data. 
These are of two kinds: (1) broad generalizations based upon small N’s, 
or upon groups which for other reasons are doubtfully representative; 
and (2) generalizations which are contradicted by the data given. Six 
additional examples are here brought together. 

1. On p. 567 Kinsey asserts, in bold type, that ‘‘Not more than 62 per cent 
of the upper level male’s outlet is derived from marital intercourse by the age 
of 55.” On checking back to Table 85, p. 348, we find that there were only 81 
upper-level married men above the age of 45 years for whom data on source of 
outlet are given. From Table 56, p. 252, we find that there were only 109 
married men in the total population (all educational levels combined) of ages 
51-55, and only 67 above the age of 55. Surely bold type is hardly suitable for 
sweeping conclusions based on such limited populations. 

2. “Of all religious groups they [the orthodox Jewish males] are the sexually 
least active, both in regard to the frequencies of their total sexual outlet, and in 
regard to the incidences and frequencies of masturbation, nocturnal emissions 
and the homosexual” (p. 485). The author attributes the relatively low sexual 
activity in this group to “the pervading asceticism of Hebrew philosophy (p. 
486), and in other passages he blames this ancient Jewish asceticism for the 
unrealistic severity with which most of the Christian peoples condemn depar- 
tures from the Talmudic ideals. Probably very few readers will examine 
Kinsey’s tables closely enough to discover that this interpretation is based on 
an N of only 59 orthodox Jews in the entire U.S., all of college level! 

3. “Among the males who remain unmarried until the age of 35, almost 
exactly 50 per cent have homosexual experience between the beginning of 
adolescence and that age”’ (p. 623). This statement has been quoted by re- 
viewers without any question of its validity; they have not taken the trouble to 
find out that it is based on an N of 68 for the 0-8 educational level, less than 50 
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for the 9-12 level, and 71 for the 13+ level (Tables 141, 142, 143). 

4. ‘‘The condemnation of petting on the ground that it may lead to some- 
thing that is worse is quite unfounded, for there is no evidence that the fre- 
quency of premarital intercourse has increased during recent generations... ”’ 
(p. 541). As we have previously shown, only for the males of college education 
is Kinsey’s statement approximately true, and even here the incidence figures 
for premarital intercourse run appreciably higher for the younger group from 
age 19 through 24. 

5. On p. 392 it is stated that ‘‘The persons involved in these [various illicit] 
activities, taken as a whole, constitute more than 95 percent of the total male 
population. Only a relatively small proportion of the males who are sent to 
penal institutions for sex offenses have been involved in behavior which ts ma- 
terially different from the behavior of most of the males in the population.” Italics 
by the reviewer.) Even if the first of these statements were true, there would 
still be reason to doubt the validity of the second. It is as though one said that 
if 95 percent of all males have at some time in their lives st.len something, those 
who are sent to penal institutions for theft or burglary are not materially dif- 
ferent from most males in the population. 

6. On p. 387, speaking of the eighth grade teacher’s violent reaction on dis- 
covering that a boy in her class has had intercourse with one of the girls, Kinsey 
says: ‘‘The teacher does not realize that more than a fourth (28%) of all her 
other eighth grade boys have similarly had intercourse.”” This statement is mis- 
leading. Table 136 which he cites in support of his statement merely tells us 
that of about 680 adult males of all ages (and of unspecified origin) who did not 
go beyond the eighth grade, 28 percent admitted having had intercourse before 
the age of 15 years. Again one would like to know who these adult males were 
who had only 0-8 grades of schooling. 


Judgments of evaluation or interpretation. Although Kinsey has more 
than once asserted that his job is to discover and to report facts, he nev- 
ertheless does not hesitate to express judgments of evaluation and inter- 
pretation for which no data, or only inadequate data, are given. Such 
judgments often appear to be based upon nothing more than vague im- 
pressions or intuitions. Some of them are closely akin to moral evalua- 
tions, although Kinsey time and again disavows any intention to pass 
moral judgment or any competency to do so. 


1. On p. 211 the author passes severe judgment on the 57 psychologists and 
70 psychiatrists who are said to be “prejudiced” against masturbation, homo- 
sexuality, and extra marital intercourse. ‘““These [prejudices] are all rationaliza- 
tions, clutched at in support of a sexual suppression that is too often taken for 
sublimation.” 

2. Regarding the 179 males of lowest frequency in his population, the author 
says (p. 209) that 52.5 percent of them were ‘‘apathetic,’’ and that they “never, 
at any time in their histories, have given evidence that they were capable of 
anything except low rates of activity.’ On p. 211 we are told that 58 percent 
of these 179 cases were “timid or inhibited individuals—afraid of their own self 
condemnation if they were to engage in almost any sort of sexual activity.” 
The author adds that “some of these individuals become paranoid in their fear 
of moral transgression, or its outcome,” and that 9 of them had attempted 
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suicide. These alleged effects of sexual restraint, presented as if they were 
typical, could well frighten the too chaste youth into greater sexual activity. 

3. We read on p. 525 that: “‘On the whole, the males who are most depend- 
ent upon nocturnal emissions are those who are slow in developing physically, 
those who are slow in their nervous reactions or unresponsive to the usual sexual 
stimuli, or those who are timid and awkward in making social contacts. They 
are the males who are most often restrained for moral reasons. There are some 
outstanding exceptions to this, proving that a multiplicity of factors may be 
involved in determining the frequencies of nocturnal emissions; but, by and 
large, emissions are most often depended upon by the male who has not made what 
the psychiatrist would call a good socio-sexual adjustment.” (Italics by the re- 
viewer.) Here, obviously, Kinsey is reporting a personal impression which may 
or may not accord with the facts, and the language he uses suggests that a 
youth who wants to escape socio-sexual maladjustment would do well to sup- 
plement his nocturnal emissions by some other kind of outlet—presumably 
sexual intercourse, since masturbation usually lacks the social element. 

4. On p. 383 Kinsey says: ‘‘As a matter of fact, a male who has been so re- 
strained [he is talking about the college-educated male] often has difficulty in 
working out a sexual adjustment with his wife, and tt ts doubtful whether very 
many of the upper level males would have any facility in finding extra marital 
intercourse, even tf they were to set out deliberately after it.” Italics by the re- 
viewer.) However, the data given in Table 85, p. 348, show that it is precisely 
the upper level male who, as he gets older, finds a larger and larger proportion 
of his intercourse outside of marriage, whereas the opposite is true of the 0-8 
and 9-12 education levels. 

5. “For the boys who have not been too disturbed psychically, masturbation 
has, however, provided a regular sexual outlet which has alleviated nervous 
tensions; and the record is clear in many cases that these boys have on the 
whole lived more balanced lives than the boys who have been more restrained 
in their sexual activities’ (p. 514). The judgment here about “balanced lives”’ 
is primarily psychological or psychiatrical, but it is doubtful whether many 
psychologists or psychiatrists would render so positive a judgment, pro or con, 
on this issue after a single interview. 

6. On p. 661 the author, apropos of an individual’s preference for a sexual 
partner of the same or opposite sex, says that ‘“‘This problem, is after all, part 
of the broader problem of choices in general: the choice of the road that one 
takes, of the clothes that one wears, of the food that one eats, of the place in 
which one sleeps, and of the endless other things that one is constantly choos- 
ing.”’ That it is a problem of choosing is evident enough, but to many a reader 
the implication of the passage will be that the sex chosen as partner in a sexual 


activity is as unimportant as one’s preferences regarding food or the cut of one’s 
clothes. 


At this point the reader may protest that it is unfair to quote these 
passages out of their context. So far as the individual passages are con- 
cerned, the reviewer does not think that their meanings have been seri- 
ously distorted. In one respect, however, the list of quotations is un- 
fair; namely, in the fact that passages scattered through several hun- 
dred pages of the book are here brought together and laid end to end. 
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Admittedly they are not random samples of the author’s generalizations 
and evaluations—no man of science would be consistently so reckless in 
his use of language. Nevertheless, the passages quoted are just the ones 
that are most likely to impress the youth who is in search of authorita- 
tive justification for the unrestrained satisfaction of his sexual urges. 


SUMMARY 


Insufficient information has been given about the specific questions 
asked in the interview, and about the precautions taken to reduce cover- 
up and to eliminate the possible effects of suggestion. For this reason it 
would be difficult if not impossible for anyone to repeat the experiment 
and obtain comparable results. 

The author has failed to give adequate information about the sources 
from which he obtained the various segments of the total population 
interviewed. From the facts given, we are not able to judge the repre- 
sentativeness of any of the sub-groups; the rural, the urban, the oc- 
cupational classes, the 0-8 or the 9-12 educational levels, or the different 
religious classes. It is especially regrettable that no definite information 
has been given about the number of subjects among the 5,300 males 
interviewed who were obtained from penal or mental institutions, from 
underworld sources, or from homosexual groups. We are not told the 
proportion of divorces to marriages among those in the 5,300 who were 
or had been married. We are not even given the complete age distribu- 
tion of subjects at the time they were interviewed. 

From McNemar’s criticisms of the sample-size experiment it is evi- 
dent that Kinsey needed more statistical guidance than he has had. It 
is a pity that the 40 pages wasted on this experiment were not devoted 
to specific information about the nature and source of the individual 
groups that made up the total population. 

The statistical “correction” of the found data for the purpose of 
showing incidences and frequencies for the total U. S. population of 
given age, educational level, or other class is a dubious procedure at the 
present stage of the investigation, dubious both because of the small 
N’s in many sub-groups and because of their doubtfully representative 
nature. 

No distinction has been made between reports based upon remote 
recall and reports of current or near-current activities. The two kinds 
of data have been thrown together and given equal weight. Nowhere 
are we told what proportions of the data for a given age are based upon 
remote recall and upon report of recent events. This procedure increases 
the N’s for given activities, but at what cost in reduced validity of data 
no one can say. 
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In his text Kinsey has made many sweeping factual statements 
which are only weakly supported by the statistical tables, or which, in 
certain instances, are contradicted by the tables. Other statements are 
made as though they were factual when apparently they are based 
upon nothing more than personal impressions. 

Notwithstanding Kinsey’s frequent reiteration that his job is to 
report facts rather than evaluations, it has been possible to quote nu- 
merous passages in which recklessly worded and slanted evaluations are 
expressed, the slanting being often in the direction of implied preference 
for uninhibited sexual activity. 

The reviewer fully agrees with Kinsey that the facts about human 
sexual behavior should be brought to light, and he regards this investi- 
gation as so important that he sincerely hopes it can be carried through 
to completion. But he also hopes that the many faults of exposition and 
interpretation to be found in this volume will not be repeated in those 
which follow. 
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Assessment of men—-selection of personnel for the Office of Strategic Serv- 
ices. The OSS Assessment Staff. New York: Rinehart, 1948. Pp. 
xv+541. 


The development of the Office of Strategic Services covered the ini- 
tiation and implementation of the intricate, highly diverse, full-scale 
type of intelligence service necessary for a modern nation at war. As 
with the other military services, it developed that the ordinary recruit- 
ing procedures were not sufficiently selective and needed supplementa- 
tion by some program of personality assessment if efficient personnel 
were to be obtained. As a result an assessment staff was instituted late 
in 1943. This book sets forth clearly and in great detail the selection 
procedures which that staff utilized. It is well written, effectively il- 
lustrated, and presented in pleasing format. The program set forth is 
“organismic”’ rather than ‘“‘elementalistic,”’ designed to reveal the dy- 
namic organization of the total personality. As such it relies strongly on 
observation of the individual performing under stress in group situa- 
tions, and derives directly from the methods used by Simoneit in Ger- 
many and the War Office Selection Board program for officer candidates 
in the British Army. 

The book not only describes these procedures but attempts a justifi- 
cation of their use. Its description is eminently successful, its justifica- 
tion is less so. In reading the book one must carefully set aside the ques- 
tion of whether these assessment measures sound interesting, stimulating, 
and revelatory of personality, and ask instead whether their successful 
application to a special selection task has been objectively demonstrated 
by the data set forth. There is no final answer in this book. 

The assessment program was instituted without provision for any 
adequate criteria against which its operating efficiency could be evalu- 
ated. Such criterion data as were gathered later were fragmentary, im- 
pure, and of little scientific value. The force of this criticism is some- 
what disarmed, however, by the complete honesty and candor with 
which the authors recognize and stress this weakness. 

For this reason, the critical reader will find so many points that he 
might wish to attack that a careful discussion of them all becomes im- 
possible within the confines of a book review. One problem may serve as 
an illustration. In justifying their assessment techniques the authors 
compare the rate of neuropsychiatric breakdowns among unassessed as 
against assessed personnel. They state that the precise total of OSS per- 
sonnel was never determined, and use as a base figure a number “‘which 
is probably correct within 10 per cent.’”’ No mention is made of the fa- 
cilities for recognizing neuropsychiatric conditions or of the availability 
of channels for their disposal, although both these factors influence the 
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validity of any recorded figures for incidence. Moreover, one assumes 
that the assessed group is temporally a later group, and we know that 
length of service is a factor in producing maladjustment. Finally, we 
must assume that since the initiation of a screening program came after 
the recognition of a personnel problem, general recruiting procedures 
also were improved and the group available for assessment must have 
been more highly selected than the earlier recruits. 

Two general criticisms are valid and may be raised in the form of 
questions: 


1. Was a determined effort made and genuinely fought through administra- 
tive channels to provide for evaluative criteria at the initiation of the program? 

2. In view of the great reliance upon the personal evaluative judgments of 
the individual staff members, what criteria were set up for participation in the 
program, and what running checks, if any, were made upon the efficiency of 
staff personnel? 


The total impression of the OSS assessment program that one gets 
from this book is that it was instituted under an arbitrarily selected 
philosophy of assessment, conducted by individuals enthusiastically 
committed to this point of view, and sustained by faith rather than con- 
crete evaluative data. This is not to say that it was not a good program, 
nor that a better one could have been developed. It is merely to wish 
that belief were reinforced by scientific data. The reader probably will 
agree when the authors say “ . . . something definitely useful has come 
out of these expert labors. But it is not possible to say that they have 
added anything to our knowledge of the components and determinants 
of personality.”” He would be happier, however, had he been presented 
with more adequate data upon which to base his judgment. 

Wi1LuiaM A. Hunt. 

Northwestern University. 


Tomkins, S.S. The Thematic Apperception Test. New York: Grune & 
Stratton, 1947. Pp. vii+291. 


Clinical psychologists have for some time anticipated an authorita- 
tive, book-length discussion of the Thematic Apperception Test to 
bring order in the chaos of minor research and diverse treatment of the- 
matic material following Murray’s original presentation. Tomkins’ 
book is the work of a philosopher and scholar, as well as an able clinician. 
It contributes an original approach to the problems of scoring, inter- 
pretation and research, based upon a rationale which considers the in- 
strument in the larger setting of scientific methodology with special 
reference to the science of personality. It is presented as ‘“‘a workbook, 
rather than as a compilation of established doctrine” and offers a sys- 
tem of interpretation which in many respects is essentially new. Per- 
haps a peculiarity of the TAT is the fact that no system so far presented 
has gained general use, and it seems unlikely that this contribution will 
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supplant eclecticism, or replace more flexible approaches of the type set 
forth by Rapaport. The scoring is based upon a set of variables (‘‘vec- 
tors,”’ “levels,” ‘“‘conditioners,”’ and ‘‘qualifiers’’) which, in the hands of 
anyone but their originator, are likely to prove unduly labored. 

Of particular interest from the standpoint of methodology are the 
opening chapters which undertake an analysis of the analytic process 
engaged in by the interpreter, for the purpose of demonstrating that 
“The interpretation of TAT stories may employ canons of inference long 
accepted by other sciences.’’ This examination of the logic of deduction 
is a valuable corrective to our sometimes myopic research for substan- 
tiation of hypotheses in a complex matrix of validating criteria. Never- 
theless, some students of personality are likely te find in Tomkins an 
over-emphasis upon reductionistic, cause-effect analysis, impressive in 
its grammarian thoroughness, but likely to obscure, especially for be- 
ginners, less abstract and more illuminating interdependencies in the 
data. The approach is from microscopic to macroscopic analysis, 
whereas the reverse is the method of preference for many. 

The practicing clinician is likely to find greatest satisfaction in the 
later chapters in which, after the author has described his tools, he 
demonstrates their use. The sections on personality diagnosis, in 
which the various regions of family, love and sex, social relationships, 
work and vocational settings are discussed in significant dimensions, 
are rich in case material, with interpretation masterfully handled, and 
new diagnostic insights supplied. 

In the writer’s opinion, a highly important contribution is Tomkins’ 
discussion of repression. Here he takes issue with orthodox psychoana- 
lytic conceptions regarding the pathogenic nature of ‘‘deep”’ repression, 
enunciating a theory by which the seriousness of conflict is judged not by 
depth of repression, but by the intensity and extensity of conflicting 
wishes and the degree of deadlock between them. This concept has im- 
port for therapy and for theory of personality, as well as for the prog- 
nostic potentialities of the TAT. 

The book opens with a useful summary of important research in the 
Thematic Apperception Test and closes with a discussion of the test as 
an adjunct to therapy. Although it is unlikely to fill the need for an in- 
tegrative text in the field, it marks the coming of age of an important 
psychodiagnostic device, and by reason of its thorough exposition and 
systematic contribution deserves to become an essential reference in 
TAT research development. 

HELEN SARGENT. 

Northwestern University. 
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